Sentinel: Runtime Data Management on Heterogeneous Main MemorySystems
  for Deep Learning by Ren, Jie et al.
Sentinel: Runtime Data Management on Heterogeneous Main
Memory Systems for Deep Learning
Jie Ren
University of California, Merced
jren6@ucmerced.edu
Jiaolin Luo
University of California, Merced
jluo38@ucmerced.edu
Kai Wu
University of California, Merced
kwu42@ucmerced.edu
Minjia Zhang
Microsoft Research
minjiaz@microsoft.com
Dong Li
University of California, Merced
dli35@ucmerced.edu
ABSTRACT
Software-managed heterogeneous memory (HM) provides
a promising solution to increase memory capacity and cost
efficiency. However, to release the performance potential of
HM, we face a problem of data management. Given an appli-
cation with various execution phases and each with possibly
distinct working sets, we must move data between memory
components of HM to optimize performance. The deep neu-
ral network (DNN), as a common workload on data centers,
imposes great challenges on data management on HM. This
workload often employs a task dataflow execution model,
and is featured with a large amount of small data objects and
fine-grained operations (tasks). This execution model imposes
challenges on memory profiling and efficient data migration.
We present Sentinel, a runtime system that automatically
optimizes data migration (i.e., data management) on HM
to achieve performance similar to that on the fast memory-
only system with a much smaller capacity of fast memory.
To achieve this, Sentinel exploits domain knowledge about
deep learning to adopt a custom approach for data manage-
ment. Sentinel leverages workload repeatability to break the
dilemma between profiling accuracy and overhead; It enables
profiling and data migration at the granularity of data objects
(not pages), by controlling memory allocation. This method
bridges the semantic gap between operating system and ap-
plications. By associating data objects with the DNN topology,
Sentinel avoids unnecessary data movement and proactively
triggers data movement. Using only 20% of peak memory
consumption of DNN models as fast memory size, Sentinel
achieves the same or comparable performance (at most 8% per-
formance difference) to that of the fast memory-only system
on common DNN models; Sentinel also consistently outper-
forms a state-of-the-art solution by 18%.
1 INTRODUCTION
Heterogeneous memory (HM) is an emerging memory ar-
chitecture, complementary to the existing heterogeneity in
processing units (e.g., GPU and FPGA). Within HM, multiple
memory components with different technologies are com-
bined to construct main memory within and cross compute
nodes. HM brings a promising solution to increase memory
capacity, avoid limitation of existing memory technologies,
and increase energy efficiency. With the emerging technolo-
gies such as non-volatile memory (NVM) and high-bandwidth
memory (HBM) [33], HM is expected to be more common.
A typical HM system consists of multiple types of memory
components with varying properties (e.g., bandwidth, latency,
and capacity). As a result, HM raises a problem of data man-
agement and migration. Given an application with various
execution phases and each with possibly distinct working
sets, we must move data between memory components to
optimize performance. Ideally, hot memory pages that are
accessed by the running execution phase should be placed
in the fastest memory component with the best latency or
bandwidth, while other memory pages are filled into other
memory components.
The data management problem is more complicated, when
we consider memory size. In a public cloud, the user is charged
not only in terms of processors, but also in terms of memory
size. Fast memory in HM tends to be more expensive (e.g., in
the google cloud, regular DDR is 24x more expensive than fast
SSD). The cost of using fast memory is accumulated through-
out application execution, making the cloud service less af-
fordable for time-consuming applications. Also, the total cost
of ownership (TCO) for fast memory increases quickly (e.g.,
the average price of DRAM DDR4 (a common fast memory)
increased by 2.3x between 2016 and early 2019 [46, 58]), which
motivates the data center to reduce the usage of fast memory
as much as possible [24]. In general, reducing fast memory
size without performance loss becomes a critical optimization
target to reduce operation cost [13, 24].
Our work focuses on data management on HM for training
the deep neural network (DNN). DNN is a common workload
on data centers. Given the success of DNN in many appli-
cations and growing trend of training high-quality models
specific to individual businesses through automated machine
learning, training DNN efficiently is increasingly important to
reduce business cost and improve utilization of data centers.
Since modern HM typically serves CPU, we study CPU
with HM for DNN training. Using CPU for training is com-
monly supported by hardware vendor (e.g., Intel MKL-DNN [3]
and ARM Compute Library [12]), and has the following four
benefits. First, the trend of democratizing DNN [54] makes
CPU an appealing solution. compared with GPU, CPU is
more approachable and affordable, especially for personal
users or small-sized enterprises. Second, some data centers
1
ar
X
iv
:1
90
9.
05
18
2v
1 
 [c
s.P
F]
  1
1 S
ep
 20
19
do not have GPU and simply use CPU for training. Such ex-
amples include the Cori [39] at Lawrence Berkeley National
Lab and Stampede2 [16] at TACC for scientific machine learn-
ing [37, 38, 47, 64]. Third, for those DNN models that lack
thread level parallelism, GPU can perform worse than CPU.
For example, training some CNN (e.g., the wide-and-deep
model [21]) and some deep reinforcement learning models
(e.g., DQN [29]), CPU performs faster than GPU. In our evalu-
ation, using 8-core Intel i7-7700K CPU and NVIDIA Titan XP
GPU to train the wide-and-deep model, the training through-
put on CPU is 4x of on GPU (763 and 196 global steps per
second for CPU and GPU respectively). Fourth, on a public
cloud, CPU is cheaper than GPU. For example, on the google
cloud, one vCPU is only 1/46 and 1/78 of NIVIDA P100
and V100 GPU, in terms of cost per hour. When the training
throughputs on CPU and GPU are comparable, using CPU
can be easily more cost effective.
The workload characterization of training DNN imposes
unique challenges on data management on HM. First, training
DNN is typically featured with a large amount of data objects
smaller than a memory page (4KB). Profiling memory accesses
to those data objects to make the decision of data placement on
HM is challenging, because those small data objects can share
memory pages, creating difficulty to track memory accesses
for individual data objects.
Second, training DNN often employs a task dataflow exe-
cution model, where the whole computation is decomposed
into a large amount (thousands or even millions) of operations
in a single training step. Those operations (e.g., matrix mul-
tiplication and 2D convolution) represent various execution
phases with diverse memory accesses patterns. Many of those
operations can run in parallel and take short execution time.
Placing data objects on HM for those operations has high re-
quirements on the promptness of making the data placement
decision and data migration, in order to make best use of fast
memory and enable high performance.
Third, the semantic gap between operating system (OS)
and application (DNN in our case) can make data migration
less efficient. From the OS point of view, the memory page
is the primitive granularity for memory profiling and data
migration. However, from the application point of view, the
data object (or the tensor in the language of DNN) is the
primitive granularity for operations to access and compute.
Ideally, we want to migrate data at the granularity of data
objects to enable high performance of operations. However,
this is conflicting with the abstract (i.e., the memory page) OS
uses to manage data.
Unfortunately, the existing data management methods can-
not address the above challenges well. Many methods [10, 30,
36, 70, 71] use a sampling-based approach to profile memory
pages to avoid expensive profiling overhead, which can miss
memory accesses for small data objects and lead to incorrect
data migration decision; Furthermore, the application seman-
tics is often missed, leaving many opportunities to improve
performance on the table.
In this paper, we present Sentinel, a runtime system that au-
tomatically optimizes data migration (i.e., data management)
on HM to achieve performance similar to that on the fast
memory-only system with a smaller size of fast memory. To
achieve this, Sentinel exploits domain knowledge about deep
learning to adopt a custom approach for data management.
Sentinel leverages the fact that a DNN training workload
has high repeatability and hence highly predictable. Such
a workload comprises of millions of training steps, each of
which goes through the exactly same computation graph and
operations. As a result, Sentinel uses a few training steps
for profile measurements. In addition, Sentinel enables pro-
filing and data migration at the granularity of data objects
(not pages), by controlling memory allocation. This method
bridges the semantic gap between OS and application. More
importantly, it allows the runtime system to employ the lim-
ited domain information for runtime data management: By
associating data objects with the DNN network topology, Sen-
tinel avoids unnecessary data movement and proactively trig-
gers data movement.
Data migration faces a fundamental tradeoff between mi-
gration frequency and performance benefit. To save the capac-
ity of fast memory, we want to frequently move data between
fast and slow memory, such that we can timely move unused
data out of fast memory and move to-be-used data into it.
However, frequent data movement can be exposed to the crit-
ical path and cause performance loss. Hence, choosing an
appropriate migration interval is critical to reduce memory
capacity and avoid performance loss. Guided by the profiling
results, Sentinel explores the optimal migration interval for
best performance.
The key contributions of our work are as follows.
• Performance characterization. We use a data object-
centric (instead of page-centric) approach to system-
atically analyze the performance of DNN, a typical
task dataflow workload. We identify and leverage the
unique characteristics of such a workload to direct data
management at runtime on HM.
• Runtime system. We propose and evaluate a runtime
system for optimizing data placement and migration;
We determine the optimal migration interval based on
theoretical analysis, dynamic profiling, and DNN do-
main knowledge.
• Evaluation. We evaluate Sentinel using TensorFlow. The
evaluation results show that using only 20% of peak
memory consumption of DNN models as the fast mem-
ory size, Sentinel achieves the same or comparable per-
formance (at most 8% performance difference) to that of
the fast memory-only system on common DNN models.
Sentinel also consistently outperforms a state-of-the-art
solution by 18%.
2 BACKGROUND
We provide a brief background on the training process of
DNN models and HM.
2
2.1 Training Deep Learning Models
A typical DNN model comprises of a stack of layers each of
which is a group of neurons. Each neuron in a layer computes
a non-linear function of the outputs of neurons in the preced-
ing layer, using a set of weights. Training DNN often involves
a large number of iterative training steps. In each step, a batch
of training samples are fed into DNN. Performance of each
step (e.g., execution time and memory access pattern) remains
stable across steps [43, 44, 63]. The above characteristics allow
us to use dynamic profiling of the first few training steps to
improve performance of the following steps.
Training DNN often uses a machine learning framework,
such as TensorFlow [8], PyTorch [5], and MXNet [4]. These
frameworks use a dataflow execution model where the whole
workload of DNN is modeled as a directed graph composed of
a set of nodes. Operations, such as 2D convolution, matrix mul-
tiplication, and array concatenation, are implemented by the
frameworks as primitives. Those operations are represented as
nodes in the dataflow graph. Within the graph, edges between
nodes capture dependencies between nodes.
2.2 Data Management on Heterogeneous
Memory
The online data management on HM typically involves three
fundamental steps: (1) memory profiling, (2) decision making
for data migration, and (3) data migration.
The memory profiling step collects memory access informa-
tion for pages or data objects; The decision making step uses
performance models or caching algorithms to decide which
pages to migration for best performance; The data migration
step triggers data migration with the goal of reducing data
migration overhead.
Recent research efforts. The three steps create major opti-
mization targets in the existing research efforts. We list the
targets as follows. Our work shares the same optimization
targets as the existing efforts.
• Memory profiling must have ignorable impact on appli-
cation performance while being accurate;
• The decision-making process must timely capture those
hot data for migration without violating the capacity of
fast memory;
• The data migration cannot impact performance.
The existing efforts explore hardware techniques for profil-
ing and facilitate data movement among memory devices [9,
14, 55–57, 66, 72, 75]. Further studies on software-based tech-
niques use sampling-based approaches for memory profiling
to reduce profiling overhead [10, 30, 36, 70, 71]; They com-
monly use a caching algorithm, such as the multi-queue [30,
57, 77], FIFO [74], or LRU [36]. However, They often trade
memory profiling accuracy for low profiling overhead, and
hence can lose tracking for small data objects. Also, the process
of detecting hot pages may not timely trigger data migration.
We study a fundamentally new method for data manage-
ment on HM. Inspired by the trend of using domain specific
knowledge for hardware (e.g., AI accelerators [20, 35, 52, 60]
and Anton for molecular dynamics simulation [61]) and soft-
ware (e.g., domain specific language Halide [2] and Liszt [22]),
we propose to use the domain knowledge of DNN to direct
data placement. This method provides lightweight profiling,
accurately captures data hotness, timely trigger data migra-
tion, and effectively hides data migration overhead.
3 ANALYSIS AND CHARACTERIZATION
OF MEMORY ACCESSES IN DNN
We analyze and characterize memory accesses in DNN and
use the analysis results to drive our design.
3.1 Profiling Framework
We build a profiling framework for our study. The profiling
framework collects the following information: the number of
main memory accesses per data object (tensor), data object
size and lifetime.
To collect the above information, the profiling framework
includes the support at both OS and application levels. At the
OS level, Sentinel collects the number of memory accesses at
the page level. This is implemented by a software-only solu-
tion. In particular, when a page is tracked for access counting,
Sentinel sets a reserved bit (bit 51) in its PTE (i.e., poisoning
PTE) and then flush the PTE from TLB. When the page is
accessed, a TLB miss occurs and triggers a protection fault.
Sentinel uses a customized fault handler to count this page ac-
cess, poison the PTE and flush it from TLB again to track next
page access. Poisoning PET only happens during the profiling.
After it, poisoning PTE and flushing TLB do not happen.
To bridge the semantic gap between OS and application,
each memory page has only one data object (but a data object
can use more than one pages). Using this method, page-level
profiling becomes data object-level profiling. Such memory
allocation does not change memory access patterns captured
by the hardware caching mechanism in the cache hierarchy,
hence providing reliable estimation on memory accesses in
main memory. Such memory allocation increases memory
footprint. But it happens during the profiling phase of Sen-
tinel on slow memory. After the profiling phase, data objects
are re-organized to reduce memory footprint and improve
performance. Data reorganization happens during memory al-
location (see Section 4.2), and hence does not stop the training
process and does not impact performance. Also, the profiling
method does not increase the consumption of fast memory.
At the application level, Sentinel leverages memory allo-
cation and deallocation to get the size and lifetime of data
objects. Furthermore, Sentinel introduces API that allows the
user to annotate DNN to indicate the end of each layer in DNN.
Based on the above infrastructure, Sentinel is able to associate
a data object with the DNN model topology of DNN (i.e., we
can know which layer(s) a data object is alive). Setting up the
association is helpful to direct data migration (Section 4.4).
Our profiling method uses only one training step for pro-
filing. During the profiling, Sentinel captures each page read
and write by repeatedly poisoning the page. This is expensive
because of system calls and TLB misses. However, it does
3
851.5
4.1 8.9 0
705.9
0 8.9 8.9 18.5
0
250
500
750
1000
0
500
1000
1500
2000
1 (1,8] [9,16] [17,24] [25,32] [33,40] [41,48] [49,56] [57,64] >64
Si
ze
 (M
B)
Nu
m
be
r o
f d
at
a 
ob
je
ct
s
# of large data objects # of small data objects Total size
~~~ 35548
0
Figure 1: Distribution of lifetime of data objects and their sizes
for ResNet_v1-32. Each small data object is smaller than 4KB; Each
large data object is no smaller than 4KB. “>64” means that the data
object survives more than one forward and backward pass (one
training step).
20320
(907MB)
6942
(213MB)
1891
(87MB)
2003
(83MB)
1680
(38MB)
991
(54MB)
572
(30MB)
597
(21MB)
533
(15MB)
341
(39MB)
2953
(4MB)
52%
70%
75%
80% 85%
87% 89% 90% 92%
92%
100%
0%
20%
40%
60%
80%
100%
0
6000
12000
18000
24000
[1,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90, 100) 100+
Cu
m
ul
at
iv
e %
Nu
m
be
rs
 o
f d
at
a 
ob
je
ct
s 
# of data objects Cumulation of # of data objects %
Figure 2: Distribution of the number of main memory accesses at
the data object level.
not lose profiling accuracy. Also, considering that a typical
DNN training involves millions of training steps, the profiling
overhead is easily amortized.
The traditional profiling methods face a fundamental dilemma
between profiling overhead and accuracy. In particular, fre-
quently collecting memory access information brings high
profiling accuracy at the cost of large runtime overhead, and
vice versa [10, 30, 36, 70, 71]. Leveraging the repetitiveness of
DNN training, Sentinel breaks the dilemma, and enables both
high profiling accuracy and low profiling overhead.
3.2 Profiling Results and Analysis
We use the profiling framework to study data objects and their
access patterns in DNN. We report profiling results for one
training step in this section.
Figure 1 shows the distribution of lifetime of data objects
and their accumulated sizes for ResNet_v1-32 (the configura-
tion of training is in Table 3). ResNet_v1-32 has 64 layers (in a
forward and backward pass). A data object is alive after it is
allocated and before it is freed. The lifetime of a data object is
defined in terms of number of layers where the data object is
alive. Figure 1 shows that 92% of data objects have lifetime no
longer than one layer. Among those short-lived data objects,
98% of them is small data objects (smaller than 4KB).
Observation 1: There are a large number of small data ob-
jects with short lifetime in DNN workloads.
In the rest of the paper, we define short-lived data objects
as those with lifetime no longer than one layer.
19575
(3.9MB)
6921
(16KB)
1887
(5KB)
1968
(2KB)
1677
(3KB)
823
(3KB)
527
(4KB)
593
(7KB)
529
(8KB)
231
(8KB)
2953
(3.8MB)
52%
70% 75%
80% 85%
87% 89% 90% 92%
92%
100%
0%
20%
40%
60%
80%
100%
0
6000
12000
18000
24000
[1,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90, 100) 100+
Cu
m
ul
at
iv
e %
Nu
m
be
rs
 o
f d
at
a 
ob
je
ct
s 
sm
al
le
r 
th
an
 4
KB
# of small data objects Cumulation of # of data objects %
Figure 3: Distribution of the number of main memory accesses at
the data object level for small data objects (each is smaller than
4KB).
Figure 2 shows the distribution of the number of main
memory accesses at the data object level. The figure shows
that a large number of data objects (52.3% of data objects, using
907 MB, which is 54% of total memory pages) are accessed less
than 10 times. Among them, 98% of them are small (less than
4KB) and use only 3.9 MB in total, shown in Figure 3. On the
other hand, some data objects are frequently accessed (having
>100 accesses), taking only 4 MB (0.2% of total memory pages).
They are the candidates to be placed into fast memory, and
their size is a small portion of total memory pages.
Observation 2: The uneven distribution of hot and cold
data objects in DNN provides opportunities for data manage-
ment.
Table 1: Memory consumption (in one training step) in the origi-
nal execution and using “one data object per page” in the profiling
step. “prof.” stands for “profiling”.
memory consumption in prof. Orig. exe.
all data objects 1.97 GB 1.57 GB
data objects
smaller than 4KB 152 MB 0.45 MB
Table 1 shows memory consumption for two cases: (1) the
original execution and (2) using “one data object per page” in
the profiling step. In the original execution, small data objects
takes only 0.45MB, but using one data object per page, they
take 152 MB. This indicates that small data objects commonly
share pages with other data objects.
Figure 4 shows the distribution of the number of main mem-
ory accesses at different levels, including at the data object
level already shown in Figures 2 and 3, and page level in the
original execution. The figure shows that for less frequently
accessed data objects (having 1-10 accesses), the total size of
data objects (907 MB) is larger than the total page size (763
MB) in the original execution.
This result is interesting, because if alive data objects fall
into the same pages, the size of the data objects should be
smaller than or equal to the size of pages. Our result is against
the above rationale, which suggests that some data objects
actually do not fall into those 763MB-pages in the original
execution. This means in the original execution, those data ob-
jects fall into other pages that are counted as more frequently
4
accessed. In other words, those data objects share the pages
with other data objects that may have different preference for
data placement. We refer to the above result as page-level false
sharing in the rest of the paper.
Observation 3: Page-level false sharing exists in DNN. The
page-level profiling (not data object-level) for data manage-
ment can be misleading because of page-level false sharing.
763
299
146 119 89 85 71 52 43 59 53
907
213
87 83 38 54 30 21 15 39 4
0
200
400
600
800
1000
M
em
or
y s
ize
 (M
B)
Number of main memory accesses
Total page size Total size of all data objects
[1-10) [10-20) [20-30) [30-40) [40-50) [50-60) [60-70) [70-80) [80-90) [9-100) >100
Figure 4: Distribution of the number of main memory accesses at
the levels of pages (in the original execution), data objects, and
small data objects.
4 DESIGN
4.1 Overview
Sentinel consists of multiple components, shown in Figure 5.
The dynamic profiling component collects memory access
information at the data object level, and decides the lifetime
of data objects based on customized memory allocation and
limited user annotation. The dynamic profiling only uses one
training step to collect the information. After that, Sentinel
re-organizes memory allocation for short-lived data objects to
facilitate data management and avoid page-level false sharing.
Driven by the profiling results, we treat short-lived and
long-lived data objects separately. Short-lived data objects are
allocated in a contiguous memory space in fast memory, and
are not involved in data movement between fast and slow
memories. This method avoids inefficient data movement due
to short liveness.
To handle long-lived data objects, Sentinel uses an adaptive
migration algorithm. The algorithm partitions the training
process in a training step into migration intervals, based on
the DNN model topology. In a migration interval, Sentinel
migrates data objects needed for the next interval, overlap-
ping application execution with data migration. During data
migration, Sentinel must determine an appropriate migration
interval, such that the data objects can be timely migrated from
slow to fast memory before they are needed by application
execution. We formulate the problem and determine the opti-
mal migration interval. We also use a test-and-trial algorithm
to determine if the migration cannot happen timely, whether
continuing migration or not can lead to better performance.
In general, Sentinel uses the following domain knowledge
to enable high performance of DNN training.
• Repetitiveness of DNN training for profiling and pre-
dicting memory access patterns;
• The liveness of data objects (tensors) within and across
layers to decide data migration;
Figure 5: Overview of Sentinel. “DO” stands for “data object”. The
white and showed boxes represent functionality and mechanisms,
respectively.
• The DNN model topology (i.e., layers) and its depth to
decide the optimal migration interval and trigger data
migration.
4.2 Dynamic Profiling and Data
Reorganization
Sentinel integrates the profiling framework in Section 3.1 into
the TensorFlow runtime system. We favor dynamic profiling
instead of static one, although the static dataflow graph can
be known before the training starts, because thread-level par-
allelism within an operation and across operations cannot be
captured by static profiling. Such parallelism has significant
impacts on data locality.
Based on the profiling results, Sentinel uses a customized
memory allocation for the remaining training steps. In par-
ticular, short-lived data objects that have the similar memory
access pattern (including number of accesses and memory
allocation and deallocation times) are allocated into the same
page, in order to avoid page-level false sharing and reduce
TLB misses. This is implemented by associating a bit string
with each data object. The bit string indicates which layer this
data object is accessed. Data objects that have the same bit
string are grouped. Data objects falling into the same group
are sorted in terms of number of memory accesses. The data
objects in a group are allocated and packed into the same set
of pages, following the increasing order.
Furthermore, Sentinel preallocates a memory pool to meet
the memory allocation requests for short-lived data objects.
Since those data objects are frequently allocated and freed, us-
ing the memory pool can avoid repeatedly returning memory
to the system, mitigating unnecessary overhead.
4.3 Handling Short-Lived Data Objects
During DNN training, a single short-lived data object is not
accessed many times (e.g., less than 10 times in ResNet) in
main memory, compared to many long-lived data objects.
Hence, the data placement of a specific short-lived data object
has an ignorable impact on the performance of DNN train-
ing. However, as our profiling results show that there are a
large amount of short-lived data objects throughout the whole
training process, and they share the same memory access char-
acteristics (i.e., short-liveness, small size, and a small number
of accesses in main memory). We must use a general policy to
manage them.
5
We use the following algorithm to manage short-lived data
objects. We allocate a continuous memory space in fast mem-
ory for short-lived data objects. Data objects in this space are
never considered for migration. This space is reused for short-
lived data objects, as they are allocated and freed throughout
the training steps. The space is allocated at the beginning of
each migration interval to accommodate short-lived data ob-
jects in the interval. Doing this, Sentinel guarantees that there
is always memory space for short-lived data objects (i.e., no
competition from long-lived data objects, because the place-
ment of short-lived data objects is critical for performance.
Within an migration interval, the space is dynamically shrunk
to free space for long-lived tensors, when a memory page in
the space is freed. We collect short-lived data objects in this
memory space, such that those short-lived data objects allo-
cated and accessed at the similar time can be placed into the
same page to avoid page-level false sharing.
The above method addresses the limitation of the exist-
ing methods that use a caching algorithm [30, 36, 57, 74, 77]
or counting the number of memory accesses within a time
window [10, 71]. They move short-lived data objects to slow
memory, even though they are not accessed any more. This
has two problems: (1) Unnecessary data movement causes
performance loss and wastes memory bandwidth; (2) Short-
lived data objects unnecessarily stay longer in fast memory,
wasting valuable fast memory space. This is because making
the decision on the movement of short-lived data objects takes
some time, due to the necessity of collecting memory access in-
formation to run the caching algorithm. In addition, counting
the number of memory accesses for individual data objects
can be inaccurate, because they can share memory pages and
the number of memory accesses to each data objects is small.
Using our algorithm based on the DNN domain knowledge,
we do not have the above limitation.
In our design, fast memory is always large enough to host
short-lived data objects. If not, short-lived data objects will be
frequently moved between fast and slow memories. This data
movement is highly inefficient in terms of both performance
and energy efficiency, especially for data objects with a short
lifetime. Hence, we assume that the fast memory size is at least
larger than the peak memory consumption of those short-lived
data objects. We have a discussion on the fast memory size in
Section 4.5.
Since short-lived data objects are frequently allocated and
freed and we reuse the same memory space to host them, the
size of memory space for short-lived data objects is small, and
typically bounded by a few GB.
4.4 Adaptive Data Migration
We migrate data for those long-lived data objects. The data
migration is controlled by the migration interval. The migration
interval determines how frequently we migrate data between
fast and slow memories. A training step is partitioned into
many equal-sized migration intervals. Figure 6 generally de-
picts data migration.
…
O
ne
 L
ay
er
O
ne
 L
ay
er
…
O
ne
 L
ay
er
O
ne
 L
ay
er
… …
Migration Interval A
(n layers)
Migration Interval B
(n layers)
Trigger migration for after-B 
(from S to F)
Trigger migration for B 
(from S to F)
Trigger migration to save space 
(from F to S)
Figure 6: Data migration based on the migration interval. “S” and
“F” stand for slow and fast memories respectively.
Data migration from slow to fast memory is triggered at the
beginning of each interval, aiming to prefetching data objects
needed by the next interval into fast memory before the next
interval starts. The data migration happens in the middle of
each interval, in order to overlap data migration with DNN
training as much as possible, such that the overhead of data
migration is removed from the critical path.
Data migration from fast to slow memory is triggered and
happens in the middle of the interval, when the long-lived
data object is not accessed in the interval. Such data migration
is used to save the space of fast memory as much as possible,
in order to accommodate upcoming data migration.
We define the migration interval in terms of layers in DNN,
not in terms of execution time, because of the following three
reasons. First, the layer-based migration interval naturally
guarantees the completion of operations at the end of the
interval, because no operation runs across layers. The time-
based migration interval cannot guarantee that, which brings
inevitable synchronization between application execution and
data migration, causing performance loss. Using the DNN do-
main knowledge (i.e., the layers), we avoid the above problem.
Second, each layer is associated with a computation phase
that shows a memory access pattern. The layer-based migra-
tion interval allows us to easily leverage the memory access
patterns collected at the profiling phase to guide data migra-
tion. Third, the time-based migration imposes challenges on
deciding which operations are in which migration interval,
because of operation-level parallelism.
Determining an appropriate migration interval is chal-
lenging. If the migration interval is either too large or too
small, we cannot achieve the best performance. Figure 7 shows
the performance when we use different migration intervals
for training ResNet_v1-32 with 1GB fast memory. The figure
reveals that the performance is very sensitive to the migration
interval. There is 21% performance variance when we change
it from 5 to 11. When the migration interval is 8, we achieve
the best performance. Hence, determining an appropriate mi-
gration interval is critical for performance.
We analyze the trade-off between large and small migration
intervals as follows. If the migration interval is large, then
the data to migrate for this interval is large. The migration
interval cannot be too large. Otherwise the data to migrate
6
0.76
0.82
0.91 0.92 0.89
0.77 0.72
0
0.25
0.5
0.75
1
0
100
200
300
400
MI=5 MI=6 MI=7 MI=8 MI=9 MI=10 MI=11 N
or
m
al
ize
d 
tr
ai
ni
ng
 th
ro
ug
hp
ut
Tr
ai
ni
ng
 th
ro
ug
hp
ut
 (i
m
g/
se
c)
Training throughput Normalized training throughput
Be
tte
r
SP
Figure 7: Performance (training throughput) variance as we change
the migration interval (MI). “SP” stands for sweet spot (the optimal
migration interval).
can be larger than the available space in fast memory. This
constraint on the migration interval is the space constraint,
formulated in Equation 1.
If the migration interval is small, then the available execu-
tion time to overlap data migration with application execution
is short. The migration interval cannot be too short. Otherwise
the data to migrate cannot be timely migrated from slow to
fast memory before the next migration internal starts. This
constraint on the migration interval is the time constraint, for-
mulated in Equation 2.
In Equations 1 and 2 , RS is the fast memory space for short-
lived data objects, S is the fast memory size, and MI stands
for the migration interval. RS is a function of the migration
interval (different migration intervals have different RS). In
Equation 1, Data is the size of data for migration in a migration
interval; In Equation 2, BW is the migration bandwidth from
slow to fast memory, and T is the DNN training time in a
migration interval. Data and T are functions of the migration
interval (different migration intervals have different Data and
T ).
Space constraint: Data(MI ) < S − RS(MI ) (1)
Time constraint: T (MI ) > (S − RS(MI ))/BW (2)
RS is relatively stable, according to our profiling results.
There is a small variance as we change MI . Hence S − RS(MI )
is near constant. Data(MI ) and T (MI ) are monotonically in-
creasing functions of MI (i.e., a larger MI indicates larger Data
and T , and vice versa). Hence, the two equations establish the
upper and lower bounds on the migration interval.
The two equations, although revealing the inherent trade-
off between small and large migration intervals, cannot reveal
the optimal one, because they do not capture the data move-
ment from fast memory to slow memory. Such data movement
increases the available fast memory space. Because of such
data movement, those migration intervals that meet the two
constraints can perform differently.
We use the following method to determine the optimal
migration interval at runtime. After collecting the profiling
results, we use Equations 1 and 2 to prune the search space
of the migration interval and choose those that meet the con-
straints. Then we use a few more training steps, each of which
employs a migration interval. We measure their performance,
and choose the optimal migration interval that leads to the
best performance.
We encounter three possible data migration cases at the
end of a migration interval. We discuss them as follows. As-
sume that we have two intervals, A and B, and B is right after
A. Sentinel migrates data at the beginning of A for B. At the
end of A, we have three cases.
• Case 1: All data migration has been finished;
• Case 2: Data migration cannot be finished, because fast
memory cannot offer enough free space;
• Case 3: Data migration cannot be finished, because there
is no enough time for migration (there is still space in
fast memory).
In Case 1, once B starts, all of the migrated data object are
in fast memory, which is the ideal case. For Cases 2 and 3, we
must avoid them. The migration interval has impact on how
often the three cases happen. Given a specific fast memory
size, a small interval can create more Case 3, and a large inter-
val can create more Case 2. Figure 8 shows how many times
each case happens, when we use different migration intervals
for training ResNet_v1-32 with 1GB fast memory. When the
migration interval decreases from 11 to 5, Case 3 increases
from 0 to 13; When the migration interval increases from 5 to
11, Case 2 increases from 0 to 4. This result is consistent with
our analysis.
3 3
5 5
4
3
2
0
1
4
3
4 4 4
10
7
1
0 0 0 0
0
100
200
300
400
0
2
4
6
8
10
MI=5 MI=6 MI=7 MI=8 MI=9 MI=10 MI=11
Tr
ai
ni
ng
 th
ro
ug
hp
ut
 (i
m
g/
se
c)
Nu
m
be
r o
f c
as
es
Case 1 Case 2 Case 3 Training throughput
SP
Figure 8: The occurrences of the three data migration cases in a
training step of ResNet-32. “MI” stands for “migration interval”.
"SP" stands for sweet spot (the optimal migration interval).
To avoid Case 2, long-lived tensors are immediately moved
out of fast memory in the middle of A, once the remaining
operations in A do not need them. This saves space of of fast
memory. However, avoiding Case 3 is difficult, because it is
created by the limited memory bandwidth and/or latency. In
Case 3, we can either continue migrating data and let B wait
for the completion of data migration, or leave data in slow
memory. The continuation of data migration exposes data mi-
gration into the critical path, but the execution of B use data
in fast memory; On the contrary, leaving data in slow mem-
ory uses the data in slow memory but avoids data migration
overhead. This is a classic trade-off between data locality and
7
data movement. To determine which method leads to the best
performance, we use a test-and-trial algorithm.
In particular, whenever Case 3 happens at the end of an
interval, we use one training step to try the continuation of
data migration, and use another training step to try no-data-
migration. We measure the performance of the two methods
and use the best method in the remaining training steps. Note
that in order to compare the performance of the two training
steps, we must ensure that data placement in the two training
steps is the same when Case 3 happens. The same data place-
ment can be easily guaranteed, given the repetitive execution
pattern in DNN training.
The above test-and-trial algorithm does not cause large
overhead, because Case 3 does not happen often and hence
does not need a large amount of training steps for test and
trial. The number of training steps used in test and trial is
usually less than 10 (see Table 3).
4.5 Discussions
The lower bound of fast memory size. Although fast mem-
ory can be smaller with Sentinel, there is a lower bound of fast
memory size to avoid big performance loss. This lower bound
is the peak memory consumption of short-lived data objects
in any migration interval plus the largest long-lived data ob-
ject. Smaller than this lower bound, the runtime system has
to either frequently migrate short-lived data objects or has no
space to accommodate long-lived data objects, which usually
causes performance loss larger than 10%.
Handling dynamic graphs. Some machine learning frame-
works, such as PyTorch and TensorFlow 2.0, support dynamic
graphs. Depending on the size of input within a mini-batch,
these frameworks generate a different graph with the right
shape to accommodate the mini-batch. With dynamic graphs,
mini-batches are not identical. Hence, there could be multiple
dataflow graphs.
To handle dynamic graphs, the existing solution pads zero
at the end of input [27], such that mini-batches have the same
structure. This transforms a dynamic graph into a static one,
but at the cost of larger memory footprint and unnecessary
computation. We use a solution similar to the one in [63] that
uses bucketed profiling. In particular, Sentinel bucketizes the
input sizes into a small of buckets (at most 10 in Sentinel),
and each bucket has a similar graph. Sentinel profiles each
bucket to collect memory access information and decide data
migration.
Handling control dependencies. A static graph can have
control flow. Depending on the value of input in a mini-batch,
the graph can have different dataflow, causing different mem-
ory access patterns. Sentinel handles this case by tracking
dataflow. Whenever a new dataflow is encountered, Sentinel
triggers profiling and makes the decision of data migration
again.
5 IMPLEMENTATION
We implement Sentinel in Linux v4.9 and TensorFlow v1.14.
We change the Linux kernel for memory profiling; We change
Figure 9: Overview of Sentinel implementation. “DO” in the figure
stands for “data objects”.
the TensorFlow runtime system for page migration. The sta-
tistics of kernel modification given by git diff is 17 files
changed, 587 insertions(+), 18 deletions(-); the statistic of Ten-
sorFlow modification given by git diff is 33 files changed,
2425 insertions(+), 37 deletions(-).
Sentinel introduces three APIs to trigger/stop memory
profiling and identify layers, which are start_profile(),
end_profile(), and add_layer(). start_profile()
triggers a system call to enable tracking main memory ac-
cesses, and enables tracking of memory allocation/dealloca-
tion to record lifetime information for data objects. add_layer(),
placed at the end of each layer, informs the runtime sys-
tem of where is each layer to determine migration interval.
Adding start_profile(), end_profile() includes only
two lines of changes to the DNN model. Adding add_layer()
includes 1˜0-100 lines, depending on how many layers there
are in the DNN model. Adding those APIs do not impact
execution correctness of DNN training.
Figure 9 shows some implementation details. After collect-
ing memory access information from OS and lifetime informa-
tion from the TensorFlow runtime, Sentinel issues three helper
threads: one for information analysis to determine migration
interval and making migration decision, one for data migra-
tion from fast to slow memory, and one for the migration in
the opposite way. The two migration threads work in parallel
to accelerate migration. Sentinel uses the Linux system call
move_pages() to migrate pages.
6 EXPERIMENTAL RESULTS
6.1 Methodology
We study HM in a machine with two memory nodes. We use
one as fast local memory and one as slow remote memory.
Table 2 summarizes the hardware we use.
We evaluate five DNN models. Table 3 shows model details.
For ResNet_v2-152, LSTM, and MobileNet models in our eval-
uation, we use the implementations from TensorFlow [7]; For
ResNet_v1-32 and DCGAN, we use [6] and [1] respectively. To
use TensorFlow, the intra-op parallelism (i.e., the number of
threads to run an operation) and inter-op parallelism (i.e., the
8
Table 2: Hardware overview of experimental system.
CPU 2-socket Intel(R) Xeon(R) CPU E5-2670 v3
DRAM DDR4 - 2133MHz
Fast memory BW: 34 GB/s Latency: 87 ns
Slow memory BW: 19 GB/s Latency: 182.7 ns
Cross-socket BW 19 GB/s
0
100
200
300
400
RN(v1)
0
1
2
3
4
RN(v2)
0
1
2
3
4
5
DCGAN
k
5k
10k
15k
LSTM
0 0
10
20
30
40
MN
Tr
ai
ni
ng
 th
ro
ug
hp
ut
(e
xa
m
pl
es
/s
ec
)
B
et
te
r
FastMem Sentinel IAL SlowMem
Figure 10: Performance with Sentinel, IAL and two configura-
tions. “RN(v1)”, “RN(v2)”, and “MN” stand for “ResNet_v1-32”,
“ResNet_v2-152”, and “MobileNet”, respectively.
maximum number of operations to co-run) are set as 24, which
is the number of physical cores in a socket in our platform.
We compare Sentinel with a state-of-the-art page migration
system from Yan et al. [74]. They introduce a page migration
algorithm based on an existing page replacement mechanism
in the Linux kernel (i.e., the FIFO-based active list [74]). In [74],
they improve the performance of the page migration mech-
anism by using four threads for parallel page copying and
eight threads for concurrent page migration, and they opti-
mize page locations every five seconds. We use the same con-
figuration in our evaluation. Sentinel does not use the page
migration mechanism in [74]. Unless otherwise indicated, the
size of fast memory in our evaluation is equal to 20% of peak
memory consumption in DNN models.
Table 3: DNN for evaluation. “p, m & t” stands for profiling, deter-
mining optimal migration interval, and test-and-trial.
data set
batch
size
# of training
steps for “p,m & t”
ResNet_v1-32 CIFAR-10 128 8
ResNet_v2-152 CIFAR-10 32 5
LSTM PTB 20 2
DCGAN MNIST 64 4
MobileNet CIFAR-10 64 3
6.2 Results
Overall performance. Figure 10 shows performance of Sen-
tinel and compare it with the improved active list (IAL) for
HM in [74] (a state of the art). The figure shows that per-
formance difference between Sentinel and the fast memory-
only system is very small (no difference in two models and
at most 8% difference in ResNet_v1-32), while IAL has 17%
performance difference on average (up to 32%). Sentinel is
significantly better than IAL by 18% on average (up to 37%).
Table 4 shows the number of migrations in Sentinel and
IAL. Compared with IAL, Sentinel has more migrations (88%
more on average). Frequent migrations allow Sentinel to make
Table 4: Number of page migrations in one epoch. “RN(v1)”,
“RN(v2)”, and “MN” stand for “ResNet_v1-32”, “ResNet_v2-152”,
and “MobileNet” respectively.
RN(v1) RN(v2) DCGAN LSTM MN
IAL 807308 3432254 211684 194933 144882
Sentinel 2097152 4898697 444846 353500 249290
Table 5: Peak memory consumption with and without Sentinel.
w/o Sentinel w/ Sentinel
ResNet_v1_32 6144 MB 6176 MB
ResNet_v2_152 25600 MB 25856 MB
LSTM 2048 MB 2080 MB
DCGAN 3072 MB 3136 MB
MobileNet 4096 MB 4228 MB
best use of fast memory for performance; Also, those migra-
tions are successfully overlapped with DNN training to avoid
performance loss.
Table 5 shows peak memory consumption before and af-
ter using Sentinel. Although our profiling method increases
memory consumption, it does not increase much (by 2.1% at
most). This is because data objects larger than 4KB dominate
total memory consumption. During the profiling, we do not
significantly increase their memory consumption.
1 1 1 1 1
0.85 0.92 0.87 0.82 0.820.83
0.78 0.84 0.8 0.82
0.98
0.93 0.93
0.99
0.92
0
0.25
0.5
0.75
1
RN(v1) RN(v2) DCGAN LSTM MN
No
rm
al
ize
d 
pe
rf
or
m
an
ce
Sentinel Having false sharing No space reservation No t&t
Figure 11: Performance with different strategies for data manage-
ment in Sentinel. “RN” and “MN” stand for ‘ResNet’ and “Mo-
bileNet” respectively. Performance is normalized by the perfor-
mance of the full-featured Sentinel.
Performance breakdown. We apply different strategies for
data management, in order to study the impact of various
techniques. Figure 11 shows the results. In the figure, we show
four strategies: Sentinel without handling page-level false
sharing (labeled “Having false sharing”), Sentinel without
reserving fast memory space for short-lived data objects (la-
beled “No space reservation”), Sentinel without test-and-trial
(labeled “No t&t”), and Sentinel with all techniques.
The figure reveals that among the three (handling page-
level false sharing, reserving fast memory, and test-and-trial),
reserving fast memory space for short-lived data objects is the
most effective one. We easily have 17% - 23% performance
loss without it (compared with the full-featured Sentinel).
Furthermore, because of the pervasiveness of page-level false
9
sharing, handling false sharing improves performance by 8% -
18%.
0.69
0.9 0.88 0.88 0.890.92
0.97 0.99 0.99 0.960.97 1 1 1 11 1 1 1 1
0
0.25
0.5
0.75
1
RN(v1) RN(v2) DCGAN LSTM MN
No
rm
al
ize
d 
pe
rf
or
m
an
ce
10% 20% 40% 60% of memory consumption of DNN
Figure 12: Performance with Sentinel under various sizes of fast
memory. The fast memory size is shown as the percentage of peak
memory consumption of DNN models. Performance is normalized
by that of the fast memory-only.
Sensitivity study. We change fast memory size and mea-
sure performance. Figure 12 shows the results. In general,
larger fast memory gives better performance. When the fast
memory size is 60% of peak memory consumption, all of
DNN models with Sentinel on HM do not have any perfor-
mance difference from the fast memory-only system. Also,
with Sentinel, performance is not sensitive to fast memory
size: There is only at most 8% performance variance when the
fast memory size is changed from 20% to 40% of peak memory
consumption of DNN. This result is a demonstration of how
Sentinel effectively uses data movement to make best use of
fast memory.
Saving fast memory size. Figure 10 shows that using 20%
of peak memory consumption of DNN models as fast memory
size, Sentinel on HM has almost the same performance (8%
difference at most) as the fast memory-only. This brings 80%
saving in fast memory size. Figure 12 shows that using 60% of
peak memory consumption of DNN models as fast memory
size, there is no performance loss, which comes with 40%
saving in fast memory size.
To further study Sentinel’s effectiveness, we use various
ResNets with various topology. Different ResNets come with
6
9
26
35
1.2 2
6.25 7.5
0
10
20
30
40
ResNet-32 ResNet-56 ResNet-110 ResNet-152
Pe
ak
 m
em
or
y c
on
su
m
pt
io
n 
(G
B)
Peak memory consumption in ResNet
Peak memory consumption of fast memory with Sentinel
Figure 13: Comparison between peak memory consumption of
DNN models and fast memory size for ResNet variants.
different peak memory consumption. We report the minimum
fast memory size with which Sentinel performs the same as
the fast memory-only. Figure 13 shows peak memory con-
sumption and fast memory size for all ResNet variants. The
figure shows that although peak memory consumption in-
creases quickly as ResNet becomes more complicated, the fast
memory size increases in a much slower rate. This demon-
strates the effectiveness of using Sentinel to save fast memory
size.
7 RELATED WORK
Heterogeneous main memory. Many memory technologies [17,
28, 31, 40] have been proposed to build HM. Intel Optane DC
persistent memory plus traditional DDR is an example [11, 32];
High bandwidth memory (HBM) plus DDR in Intel Knights
Landing is another example [17]. Recent research studies data
management on HM using hardware-based approaches [9, 14,
55–57, 66, 72, 75] or OS/software-based approaches [23, 26,
42, 51, 53, 62, 69–71, 74, 76]. The common goal of these studies
is to achieve a high service rate by leveraging fast memory as
much as possible.
Page placement policies and mechanisms. Existing pro-
posals [10, 30, 36, 70, 71, 74] explore various page placement
polices. They commonly profile memory access to determine
page placement. Some work [10, 30, 36] tracks hot pages by
setting and resetting PTE as Sentinel does, but this tracking
mechanism can result in very high runtime overhead. To re-
duce runtime overhead, the existing work commonly limits
the amount of pages to profile, which can compromise profil-
ing accuracy. For example, Thermostat [10] only profiles 0.5%
of total memory pages; If each page is profiled, there could be
4x slowdown. Unimem [71] and Tahoe [70] use a hardware
counter-based approach to periodically count main memory
accesses. This method, although being lightweight, can incor-
rectly count the number of memory accesses for short-lived
data objects because of the sampling nature. Unlike the above
work, Sentinel leverages domain knowledge, and hence only
profiles a small portion of total execution (one training step)
without paying large runtime overhead and losing accuracy.
Also, Sentinel associates page-level profiling results with data
objects, making profiling results more meaningful for data
migration.
Yan et al. [74] guides page placement based on an existing
Linux page replacement mechanism. Like Linux, this work
uses two FIFO queues (active list and inactive list) to make
page migration decisions. However, using this design to de-
cide page migration for common short-lived data objects in
DNN can be slow and lacks a global view, which wastes valu-
able fast memory space and causes unnecessary data move-
ment.
In terms of page migration mechanism, Yan et al. [74] uses
multi-threaded migration for single pages and concurrent mi-
gration for multiple pages. Bock et al. [15] allow application to
execute without waiting for the completion of page migration,
by buffering application writes to migrated pages in a hard-
ware buffer. Wang et al. [67] and Seshadri et al. [59] enable fast
10
page migration by enhancing DRAM architecture. Sentinel
focuses on page migration policy (not mechanism), but the
existing efforts are useful to improve performance of Sentinel.
Performance optimization for dataflow-based machine learn-
ing frameworks. Performance optimization for such frame-
works attracted a lot of research efforts recently [19, 25, 41, 43–
45, 48–50, 63, 65, 68, 73]. Some of them [18, 34, 43, 44, 63, 73]
leverage the predictability of DNN workloads to guide oper-
ation scheduling and compiler-based performance optimiza-
tion. Our work also leverages the unique characteristics of
DNN, but focuses on data management on HM.
8 CONCLUSIONS
Runtime data management on HM often uses an application-
agnostic approach. It can suffer from high overhead for mem-
ory profiling or low accuracy, cause unnecessary data migra-
tion, and/or have difficult to hide data migration overhead.
In this paper, we use a new angle to examine the data man-
agement problem. By introducing limited domain knowledge,
we are able to break the fundamental tradeoff between pro-
filing overhead and accuracy, and effectively prefetch data to
fast memory for computation. We also reveal the conflict be-
tween OS and application when handling data migration. By
resolving the conflict, we avoid unnecessary data migration.
We focus on a specific and influential domain, DNN, in our
study, given its importance on modern data centers. Using
Sentinel, DNN training on HM with a small fast memory size
can perform similar to the fast memory-only system. Also,
Sentinel consistently outperforms a state-of-the-art solution
by 18%.
REFERENCES
[1] A TensorFlow Implementation of Deep Convolutional Generative Adver-
sarial Networks. https://github.com/carpedm20/DCGAN-tensorflow.
[2] Halide. https://halide-lang.org/.
[3] Intel Math Kernel Library For Deep Neural Networks (Intel MKL-DNN).
https://github.com/intel/mkl-dnn.
[4] MXNet. https://mxnet.apache.org/.
[5] Pytorch. https://pytorch.org/.
[6] ResNet in TensorFlow. https://github.com/wenxinxu/
resnet-in-tensorflow.
[7] TensorFlow models. https://github.com/tensorflow/models.
[8] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu
Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,
Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore,
Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner,
Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Va-
sudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale
machine learning on heterogeneous systems, 2015.
[9] Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and
Stephen W. Keckler. Page Placement Strategies for GPUs within Heteroge-
neous Memory Systems. In International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2015.
[10] Neha Agarwal and Thomas F. Wenisch. Thermostat: Application-
transparent page management for two-tiered main memory. In Proceedings
of the Twenty-Second International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS 2017, Xi’an, China,
April 8-12, 2017, pages 631–644, 2017.
[11] M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rud-
off, I. M. Steiner, B. Valentine, G. Vedaraman, and S. Vora. Cascade lake:
Next generation intel xeon scalable processor. IEEE Micro, 39(2):29–36,
March 2019.
[12] ARM. ARM Compute Library. https://github.com/ARM-
software/ComputeLibrary.
[13] Joy Arulraj, Andy Pavlo, and Krishna Teja Malladi. Multi-tier buffer
management and storage system design for non-volatile memory. CoRR,
abs/1901.10938, 2019.
[14] Alan Bivens, Parijat Dube, Michele Franceschini, John Karidis, Luis Lastras,
and Mickey Tsao. Architectural design for next generation heterogeneous
memory systems. In 2010 IEEE Int. Memory Workshop, pages 1–4, 2010.
[15] Santiago Bock, Bruce R. Childers, Rami Melhem, and Daniel Mossé. Con-
current page migration for mobile systems with os-managed hybrid mem-
ory. In Proceedings of the 11th ACM Conference on Computing Frontiers, CF
’14, pages 31:1–31:10, New York, NY, USA, 2014. ACM.
[16] Texas Advanced Computing Center. Stampede2.
https://www.tacc.utexas.edu/systems/stampede26cf c665cv 5.
[17] D. W. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. S. Kim, and M. Schulte.
Reevaluating the latency claims of 3d stacked memories. In 2013 18th Asia
and South Pacific Design Automation Conference (ASP-DAC), pages 657–662,
Jan 2013.
[18] K. Chen and Q. Huo. Scalable training of deep learning machines by incre-
mental block training with intra-block parallel optimization and blockwise
model-update filtering. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 5880–5884, March 2016.
[19] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,
Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Car-
los Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end
optimizing compiler for deep learning. In 13th USENIX Symposium on
Operating Systems Design and Implementation (OSDI 18), pages 578–594,
Carlsbad, CA, October 2018. USENIX Association.
[20] Y. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An Energy-Efficient
Reconfigurable Accelerator for Deep Convolutional Neural Networks.
IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
[21] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar
Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai,
Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xi-
aobing Liu, and Hemal Shah. Wide & Deep Learning for Recommender
Systems. CoRR, abs/1606.07792, 2016.
[22] Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley,
Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken,
Karthik Duraisamy, Eric Darve, Juan Alonso, and Pat Hanrahan. Liszt: A
Domain Specific Language for Building Portable Mesh-based PDE Solvers.
In International Conference for High Performance Computing, Networking, Stor-
age and Analysis, 2011.
[23] Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan
Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten
Schwan. Data tiering in heterogeneous memory systems. In Proc. 11th
European Conf. Computer Systems (EuroSys ’16), 2016.
[24] Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying
Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti.
Reducing DRAM Footprint with NVM in Facebook. In Proceedings of the
Thirteenth EuroSys Conference, 2018.
[25] Yuanxiang Gao, Li Chen, and Baochun Li. Spotlight: Optimizing device
placement for training deep neural networks. In Jennifer Dy and Andreas
Krause, editors, Proceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learning Research, pages
1676–1684, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
[26] Michael Giardino, Kshitij Doshi, and Bonnie Ferri. Soft2LM: Applica-
tion guided heterogeneous memory management. In 2016 IEEE Int. Conf.
Networking, Architecture, and Storage (NAS), 2016.
[27] Google. Tensorflow Bucketing. https://
www.tensorflow.org/versions/r0.12/.
[28] F. T. Hady, A. Foong, B. Veal, and D. Williams. Platform storage perfor-
mance with 3d xpoint technology. Proceedings of the IEEE, 105(9):1822–1833,
Sep. 2017.
[29] Hado van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement
Learning with Double Q-Learning. In Proceedings of the Thirtieth AAAI
Conference on Artificial Intelligence, 2016.
[30] Takahiro Hirofuchi and Ryousei Takano. Raminate: Hypervisor-based
virtualization for hybrid main memory systems. In Proceedings of the
Seventh ACM Symposium on Cloud Computing, SoCC ’16, pages 112–125,
New York, NY, USA, 2016. ACM.
[31] Intel. Revolutionizing Memory and Storage. https://www.
intel.com/content/www/us/en/architecture-and-technology/
intel-optane-technology.html.
[32] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman
Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor,
Jishen Zhao, and Steven Swanson. Basic performance measurements of
the intel optane DC persistent memory module. CoRR, abs/1903.05714,
11
2019.
[33] JEDEC. JESD79-4A: DDR4 SDRAM Standard. https://www.
jedec.org/sites/default/files/docs/JESD79-4A.pdf.
[34] Hai Jin, Bo Liu, Wenbin Jiang, Yang Ma, Xuanhua Shi, Bingsheng He,
and Shaofeng Zhao. Layer-centric memory reuse and data migration for
extreme-scale deep learning on many-core architectures. ACM Trans. Archit.
Code Optim., 15(3):37:1–37:26, September 2018.
[35] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gau-
rav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,
Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,
Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir
Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann,
C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian
Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan,
Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon,
James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan
Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran
Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix,
Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps,
Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gre-
gory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing,
Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay
Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun
Yoon. In-Datacenter Performance Analysis of a Tensor Processing Unit. In
International Symposium on Computer Architecture, 2017.
[36] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan. Heteroos — os
design for heterogeneous memory management in datacenter. In 2017
ACM/IEEE 44th Annual International Symposium on Computer Architecture
(ISCA), pages 521–534, June 2017.
[37] S. Kim, H. Kim, J. Lee, S. Yoon, S. E. Kahou, K. Kashinath, and M. Prab-
hat. Deep-Hurricane-Tracker: Tracking and Forecasting Extreme Climate
Events. In IEEE Winter Conference on Applications of Computer Vision (WACV),
2019.
[38] Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda,
Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack
Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. Exascale
Deep Learning for Climate Analytics. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and Analysis,
2018.
[39] Lawerence Berkeley National Lab. Cori Supercomputer.
https://www.nersc.gov/users/computational-systems/cori/.
[40] Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin
Ipek, Onur Mutlu, and Doug Burger. Phase-Change Technology and the
Future of Main Memory. IEEE Micro, 30(1):143–143, 2010.
[41] Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico San-
tambrogio, Markus Weimer, and Matteo Interlandi. PRETZEL: Opening
the black box of machine learning prediction serving systems. In 13th
USENIX Symposium on Operating Systems Design and Implementation (OSDI
18), pages 611–626, Carlsbad, CA, October 2018. USENIX Association.
[42] Felix Xiaozhu Lin and Xu Liu. memif: Towards Programming Heteroge-
neous Memory Asynchronously. In International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), 2016.
[43] J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao. Processing-in-memory
for energy-efficient neural network training: A heterogeneous approach.
In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), pages 655–668, Oct 2018.
[44] Jiawen Liu, Dong Li, Gokcen Kestor, and Jeffrey S. Vetter. Runtime Con-
currency Control and Operation Scheduling for High Performance Neural
Network Training. In International Parallel and Distributed Processing Sympo-
sium, 2019.
[45] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Opti-
mizing CNN model inference on cpus. In 2019 USENIX Annual Technical
Conference (USENIX ATC 19), pages 1025–1040, Renton, WA, July 2019.
USENIX Association.
[46] Jon Martindale. RAM has never been cheaper, but are the historic prices
here to stay? . https://www.digitaltrends.com/computing/why-is-ram-
so-cheap/.
[47] Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows,
James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J.
Pennycook, Kristyn Maschhoff, Jason Sewall, Nalini Kumar, Shirley Ho,
Michael F. Ringenburg, Prabhat, and Victor Lee. CosmoFlow: Using Deep
Learning to Learn the Universe at Scale. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and Analysis,
2018.
[48] Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Quoc V. Le,
and Jeff Dean. Hierarchical planning for device placement. 2018.
[49] Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy
Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, and
Jeff Dean. Device placement optimization with reinforcement learning.
2017.
[50] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov,
Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul,
Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerg-
ing AI applications. In 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 18), pages 561–577, Carlsbad, CA, Octo-
ber 2018. USENIX Association.
[51] A. Narayan, T. Zhang, S. Aga, S. Narayanasamy, and A. Coskun. Moca:
Memory object classification and allocation in heterogeneous memory
systems. In 2018 IEEE International Parallel and Distributed Processing Sym-
posium (IPDPS), 2018.
[52] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany,
J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An accelerator for
compressed-sparse convolutional neural networks. In International Sympo-
sium on Computer Architecture (ISCA), 2017.
[53] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure,
and Stefano Markidis. Rthms: A tool for data placement on hybrid memory
system. In Proceedings of the 2017 ACM 9t9999999 International Symposium
on Memory Management, ISMM 2017, 2017.
[54] Tom Petrocelli. Democratization of AI Development Is Begin-
ning. https://www.cmswire.com/digital-workplace/democratization-
of-ai-development-is-beginning/.
[55] Moinuddin K. Qureshi, Michele Franchescini, Vijayalakshmi Srinivasan,
Luis Lastras, Bulent Abali, and John Karidis. Enhancing Lifetime and
Security of PCM-Based Main Memory with Start-Gap Wear Leveling. In
MICRO, 2009.
[56] Moinuddin K. Qureshi, Viji Srinivasan, and Jude A. Rivers. Scalable
High-Performance Main Memory System Using Phase-Change Memory
Technology. In ISCA, 2009.
[57] Luiz Ramos, Eugene Gorbatov, and Ricardo Bianchini. Page placement in
hybrid memory systems. In Proc. Int. Conf. Supercomputing (ICS ’11), 2011.
[58] Gina Roos. Dram Prices Continue to Climb.
https://epsnews.com/2017/08/18/dram-prices-continue-climb/.
[59] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata
Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu,
Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone:
Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In
IEEE/ACM International Symposium on Microarchitecture, 2013.
[60] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramo-
nian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Sriku-
mar. ISAAC: A Convolutional Neural Network Accelerator with In-situ
Analog Arithmetic in Crossbars. In International Symposium on Computer
Architecture, 2016.
[61] David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin,
Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J.
Bowers, Jack C. Chao, Michael P. Eastwood, Joseph Gagliardo, J. P. Gross-
man, C. Richard Ho, Douglas J. Ierardi, István Kolossváry, John L. Klepeis,
Timothy Layman, Christine McLeavey, Mark A. Moraes, Rolf Mueller,
Edward C. Priest, Yibing Shan, Jochen Spengler, Michael Theobald, Brian
Towles, and Stanley C. Wang. Anton, a Special-purpose Machine for
Molecular Dynamics Simulation. Communications of the ACM, 51(7), July
2008.
[62] Du Shen, Xu Liu, and Felix Xiaozhu Lin. Characterizing Emerging Hetero-
geneous Memory. In ACM SIGPLAN International Symposium on Memory
Management (ISMM), 2016.
[63] Muthian Sivathanu, Tapan Chugh, Sanjay S. Singapuram, and Lidong
Zhou. Astra: Exploiting Predictability to Optimize Deep Learning. In
International Conference on Architectural Support for Programming Languages
and Operating Systems, 2019.
[64] Vamsi Sripathi and Vikram Saletore. Tensor-
Flow Performance Optimization on Intel Architecture.
https://www.alcf.anl.gov/files/slides %20vamsi %20sripathi %20vikram
%20saletore %20ACLF_ANL_07253018_final.pdf.
[65] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya
Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew
Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic
high-performance machine learning abstractions. CoRR, abs/1802.04730,
2018.
[66] Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and
Jeffrey S. Vetter. Exploring hybrid memory for GPU energy efficiency
through software-hardware co-design. In Proc. 22nd Int. Conf. Parallel
Architectures and Compilation Techniques (PACT ’13), 2013.
[67] H. Wang, J. Zhang, S. Shridhar, G. Park, M. Jung, and N. S. Kim. Duang:
Fast and lightweight page migration in asymmetric memory systems. In
12
2016 IEEE International Symposium on High Performance Computer Architec-
ture (HPCA), pages 481–493, March 2016.
[68] Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon
Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic gpu memory
management for training deep neural networks. In Proceedings of the 23rd
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,
PPoPP ’18, pages 41–53, New York, NY, USA, 2018. ACM.
[69] Shasha Wen, Lucy Cherkasova, Felix Xiaozhu Lin, and Xu Liu. Profdp: A
lightweight profiler to guide data placement in heterogeneous memory sys-
tems. In Proceedings of the 2018 International Conference on Supercomputing,
ICS ’18, 2018.
[70] K. Wu, J. Ren, and D. Li. Runtime data management on non-volatile
memory-based heterogeneous memory for task-parallel programs. In
Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis, 2018.
[71] Kai Wu, Yingchao Huang, and Dong Li. Unimem: Runtime Data Manage-
ment on Non-volatile Memory-based Heterogeneous Main Memory. In
SC, 2017.
[72] Panruo Wu, Dong Li, Zizhong Chen, Jeffrey Vetter, and Sparsh Mittal.
Algorithm-Directed Data Placement in Explicitly Managed No-Volatile
Memory. In Proc. 25th ACM Int. Symp. High-Performance Parallel and Dis-
tributed Computing, pages 141–152, Kyoto, Japan, 2016.
[73] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Si-
vathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu
Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective
cluster scheduling for deep learning. In 13th USENIX Symposium on Operat-
ing Systems Design and Implementation (OSDI 18), pages 595–610, Carlsbad,
CA, October 2018. USENIX Association.
[74] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble
page management for tiered memory systems. In Proceedings of the Twenty-
Fourth International Conference on Architectural Support for Programming
Languages and Operating Systems, ASPLOS ’19, 2019.
[75] HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding,
and Onur Mutlu. Row buffer locality aware caching policies for hybrid
memories. In Proc. IEEE 2012 30th Int. Conf. Computer Design (ICCD ’12),
2012.
[76] Seongdae Yu, Seongbeom Park, and Woongki Baek. Design and Imple-
mentation of Bandwidth-aware Memory Placement and Migration Policies
for Heterogeneous Memory Systems. In Proceedings of the International
Conference on Supercomputing (ICS), 2017.
[77] W. Zhang and T. Li. Exploring Phase Change Memory and 3D Die-Stacking
for Power/Thermal Friendly, Fast and Durable Memory Architectures. In
International Conference on Parallel Architectures and Compilation Techniques
(PACT), 2009.
13
