NeuMMU: Architectural Support for Efficient Address Translations in
  Neural Processing Units by Hyun, Bongjoon et al.
NeuMMU: Architectural Support for Efficient
Address Translations in Neural Processing Units
Bongjoon Hyun Youngeun Kwon Yujeong Choi John Kim Minsoo Rhu
School of Electrical Engineering
KAIST
{bongjoon.hyun, yekwon, yjchoi0606, jjk12, mrhu}@kaist.ac.kr
Abstract—To satisfy the compute and memory demands of
deep neural networks, neural processing units (NPUs) are widely
being utilized for accelerating deep learning algorithms. Similar
to how GPUs have evolved from a slave device into a mainstream
processor architecture, it is likely that NPUs will become first-
class citizens in this fast-evolving heterogeneous architecture
space. This paper makes a case for enabling address translation
in NPUs to decouple the virtual and physical memory address
space. Through a careful data-driven application characteriza-
tion study, we root-cause several limitations of prior GPU-centric
address translation schemes and propose a memory management
unit (MMU) that is tailored for NPUs. Compared to an oracular
MMU design point, our proposal incurs only an average 0.06%
performance overhead.
I. INTRODUCTION
The complexity of deep neural network (DNN) based deep
learning (DL) algorithms are scaling up rapidly. To meet the
demands of these computation-hungry algorithms, accelerator-
centric systems based on GPUs or custom-designed ASICs for
DNNs, often referred to as neural processing units (NPUs),
are widely being utilized for accelerating DL. Similar to how
GPUs have evolved into a mainstream processor architecture,
it is expected that NPUs will become first-class citizens in
heterogeneous computing platforms due to the increasing
number of application domains it is expected to accelerate.
Based on these trends, an important challenge that arises is
how NPUs should be exposed to the end-user and how the
memory address space be exposed to the NPUs. Traditionally,
the I/O attached accelerators such as GPUs had separate
memory address space than the CPU, forcing programmers
to manage these distinct address regions through manual
memory allocations, data copies, etc. As GPUs evolved into
having a proper memory management unit (MMU) [1], [2],
[3], programmers are now given the illusion of a unified CPU-
GPU memory address [4], [5] allowing CPU and GPU to
share a globally addressable memory regardless of whether
the physical memory is shared or separate. Other key features
enabled by GPU MMUs include memory oversubscription [6],
[7], NUMA [8], [9], [10], and spatial sharing of a single GPU
substrate while supporting page-granularity protections [11].
Unfortunately, these features are yet to be available for NPUs
because they currently do not have an MMU for decoupling
virtual addresses against physical addresses. Consequently,
the range of applications that can utilize these accelerators
is limited. For instance, NPUs cannot page-fault on missing
MLP
MLP
…Bottom
MLP
Dense
Features
Feature Interaction
MLP
MLP
…Top
MLP
…
Sparse 
Features
Sparse 
Features
…EmbeddingTable0 …
Embedding
Table(N-1)
Fig. 1: High-level structure of DNN-based personalized recommen-
dation system using embeddings [13], [14], [15]. Natural language
processing based on attention modules (e.g., BERT [16], [17]),
and memory-augmented NNs [18] follow a similar topological
structure. That is, the application frontend starts by gathering multiple
embeddings from large embedding lookup tables (i.e., “sparse”
memory accesses), followed by conventional, “dense” DNN layers
(e.g., CNN, RNN, and MLP) as the backend processing step.
pages nor can oversubscribe the NPU memory, so the working
set of a target DNN application must precisely fit within the
physical memory capacity: otherwise, the runtime crashes [12].
Given this landscape, we argue that future NPUs will
need dedicated architectural support for virtual-to-physical
address translation services. Conventional wisdom in DNN’s
memory access characteristics is that they are highly regular
and predictable, allowing the compiler to optimally decide how
much memory to allocate and when/where to read/write data
for NPU processing, obviating the need for fine-grained, page-
level translations or NUMA capabilities. While such property
held true for regular, “dense” DNNs (e.g, convolutional,
recurrent, and multi-layer perceptrons, CNNs/RNNs/MLPs),
emerging DL workloads employing embedding layers exhibit a
highly “sparse”, irregular, and random memory access pattern
over a large embedding lookup table (Figure 1). Recent studies
from Baidu and Facebook [19], [20] state that their production-
level DL workloads using embeddings already reached close
to 100 GBs of memory footprint because of these embedding
tables, even for inference (Section III-A). Because current
NPUs incorporate only tens of GBs of local memory, these
memory-limited DL applications must partition its memory
usage across CPU−NPU (or NPUm−NPUn under multi-
NPU systems [21], [22]), incurring frequent CPU↔NPU (or
ar
X
iv
:1
91
1.
06
85
9v
1 
 [c
s.A
R]
  1
5 N
ov
 20
19
NPU↔NPU) data transfers. Because of the irregular, random
memory access nature of embedding layers, an MMU-less
NPU must rely on the CPU to manually orchestrate data
transfers on its behalf, experiencing significant performance
overheads (Section V). Similar to how demand paging or
NUMA has been a crucial component in CPUs (and now GPUs,
especially under CPU-GPU [8] or multi-GPU [9] systems),
we believe that a robust NPU address translation service will
enable a more diverse range of emerging, memory-hungry
DL applications to be seamlessly executed on NPUs without
falling into the pitfalls and performance overheads of manual
management of NPU physical memory.
To this end, this paper explores the design space of NPU
MMUs and identifies and addresses the unique challenges in
adding architectural support for address translation based on
a data-driven approach. Our study consists of two main parts:
1) Design space exploration of an NPU MMU that enables
robust address translation for conventional dense DNN
layers (Section IV)
2) Highlighting the usefulness of an NPU MMU in ef-
ficiently handling sparse embedding layers via fine-
grained NUMA access or page-migration (Section V)
As such, we start by first employing prior GPU-centric
MMU solutions [1], [3] that utilize I/O memory management
units (IOMMUs) to handle NPU address translations for
conventional, dense DNNs. Interestingly, our analysis shows
that, due to the fundamental architectural differences between
GPUs and NPUs, a naive IOMMU address translation incurs
significant performance overhead even for these dense DNNs.
Concretely, while GPUs commonly use the on-chip SRAM for
register-files and caches, NPUs almost exclusively utilize its
SRAM for software managed scratchpads. The activations and
weights the NPUs operate on are typically multi-dimensional
tensors, mapped to a traditional, linear (1D) memory sub-
system. These tensors are much larger than the scratchpad,
so the DMA unit blocks the activations/weights into tiles
and sequence them across multiple (tile) fetch operations via
double-buffering. As these tiles are also multi-dimensional
tensors, fetching them into the scratchpad involves projecting
the multi-dimensional coordinates into the linear space of
DRAM memory. A single tile is therefore decomposed into
minimum number of linearized memory transactions, which
can be up to several thousands, because a tile is sized at several
MBs to maximally utilize the scratchpad. Consequently, a
single tile fetch invokes significant bursts of page translations
that conventional MMUs fail to effectively capture, leading
to an average 95% performance overhead.
Overall, we observe that the bursty nature of scratchpad-
based NPU address translation traffic renders the translation
throughput of baseline IOMMU’s (multiple) page-table walkers
(PTWs) a key performance bottleneck. As a result, unlike
GPUs which are optimized for translation locality [1], [3],
we argue that NPU MMUs should be designed for high
translation throughput first and locality second. To this respect,
we propose a throughput-centric NPU MMU (NeuMMU) design
that effectively handles high burst of translation requests.
NeuMMU is designed to reduce the address translation overhead
by leveraging the deterministic memory access behavior of
dense DNNs and the inherent translation locality therein.
Concretely, while TLBs are not as effective for NPUs than
CPUs or GPUs, we identify how translation burst locality
exists within a given tile fetch in DNNs. To capture such
intra-tile translation locality, we first propose our novel
pending request merging buffer (PRMB) microarchitecture
as a translation bandwidth filtering mechanism to reduce
the number of distinct page-table walks concurrently in-
flight, boosting effective translation bandwidth. In addition
to the PRMB microarchitecture, the MMU also requires high
concurrent address translations and we evaluate the need for
a larger number of PTWs to maximize translation throughput.
Unfortunately, the large amount of translations can incur
significant power overheads due to the additional memory
accesses involved in the translation process. We make the key
observation that dense DNN layers exhibit a regular dataflow,
rendering its memory access patterns to be highly deterministic
with only a handful of key data structures being manipulated
(i.e., input/output activations, weights). This allows us to
employ a lightweight translation path “register”, unlike the
traditional MMU cache [23], that effectively filters down on the
number of translation-invoked DRAM accesses (a reduction
of average 7.1×). Putting everything together, while a naive
IOMMU causes an average 95% performance overhead, our
NeuMMU effectively closes this gap, incurring only an average
0.06% performance loss (Section IV).
With our robust NeuMMU design in hand, we then demon-
strate the usefulness of MMU’s address translation feature for
handling embedding layers. We show that an MMU-less NPU
suffers from (redundant) manual data copy operations between
CPU and NPUs, leading to an average 71% performance
loss when executing embedding layers. Our NeuMMU again
effectively closes this performance gap using direct NUMA
accesses across remote memory regions, achieving significant
performance improvements than an MMU-less NPU design
(Section V). To summarize our key contributions:
• To our knowledge, this work is the first to explore
architectural support for NPU MMUs, a significant first
step in exploring this emerging design space.
• We conduct a detailed, data-driven analysis on conven-
tional (dense) DNNs and emerging (sparse) embedding
layers, root-causing the limitations of GPU-centric IOM-
MUs in handling the bursty address translations of NPUs.
• We propose our throughput-centric NeuMMU design
based on our novel pending request merging buffer,
a high throughput parallel page-table walker, and a
translation path register, only incurring an average 0.06%
performance loss.
• Using DNN-based recommendation system models as
driving examples, we showcase how efficiently sparse
embedding layers can be handled using fine-grained
NUMA or page-migrations, enabled by NeuMMU.
2
IA
/O
A 
Bu
ff
er
 (S
PM
)
Activation
Psum
Psum
Weights
R
eL
UN
eu
M
M
U
+
 D
M
A
M
ai
n 
M
em
or
y
W Buffer (SPM)
Fig. 2: Baseline NPU architecture using a Google TPU-style systolic-
array [25]. Section II-C details the baseline NPU design.
II. BACKGROUND & METHODOLOGY
A. Address Translation in SPM-centric NPUs
NPUs generally utilize most of its on-chip SRAM as a
scratchpad memory (SPM1) whereas GPUs allocate the bulk
of this space as register-files in order to spawn as many
threads as possible for latency tolerance [24] (Figure 2). In
contrast, NPUs commonly leverage task-level parallelism to
double-buffer the SPM so that the latency to fetch the input
activations (IA) and weight filters (W) is hidden inside the
latency to execute a layer. Because the size of IA and W can be
hundreds to thousands of MBs, the DMA unit blocks the IA
and W into smaller tiles and sequence them in and out of the
SPM (e.g., typically tens of MBs in state-of-the-art NPUs [25],
[26]) across multiple iterations (Figure 3). A key reason why
NPUs prefer a SPM is because of the predictable performance
it delivers: once the data is brought in from memory to the
SPM, the latency to read/write data from/to the SPM is much
more deterministic than caches (i.e., SPM hit rate is 100%).
From an MMU standpoint, address translation is not required
when the processing elements (PEs) access the SPM during
the compute phase for layer computations. In other words, the
PEs need not have to query an MMU for address translations
when accessing the SPM. During the memory phases however,
the DMA unit does require information regarding where inside
the NPU physical address space the IA and W are located, as
detailed below.
B. NPU Programming Model and IOMMUs
Current NPU programming model. NPUs generally
feature a private, physically-addressed memory. Consequently,
the CPU must explicitly copy the necessary data structures
(e.g., IA and W) from the host memory to the NPU (physical)
memory address space. After the CPU→NPU data transfer
is complete, the NPU-side DMA unit is given the target
layer’s IA and W mapping information within the NPU
physical address space. Concretely, the DMA unit is given
the base (base) and the boundary value (bound) of the data
allocated, which is utilized to derive the physical address
of target data elements, obviating the need for a separate
address translation. Such approach is similar to the “old”
1While it is possible for future NPUs to adopt user-transparent caches, we
argue that SPMs are a more natural fit for NPUs. This is because DNNs
exhibit a highly deterministic dataflow (i.e., locality and data reuse information
is statically available), making them more amenable for optimizations using
the software-managed scratchpads for maximal resource utilization.
Time
Tile #0 Tile #1
Tile #0
Convolution/ matrix-multiplication
Load 
IA
VA→PA
(IA)
Load 
W
VA→PA
(W)
Load 
IA
VA→PA
(IA)
Load 
W
VA→PA
(W)
Tile #1
Convolution/ matrix-multiplication
Tile #2
Load 
IA
VA→PA
(IA)
Load 
W
VA→PA
(W)
Fig. 3: Compute phase is defined as the time NPU spends conducting
the necessary computations for a given tile, while memory phase is
defined as the period bringing in the IA/W tiles into SPM. Overlapping
tile(n)’s compute phase with tile(n+1)’s memory phase help
maximize NPU resource utilization. With an NPU MMU, an address
translation of tile IA/W must precede in order to fetch the actual
data.
GPGPU programming model, which suffers from the same
problems users had to face: 1) the working set must fit
within the NPU physical memory, preventing DNNs that
oversubscribe NPU memory from being executed (e.g., large
batch DNNs [12]), and 2) it becomes challenging to support
“pointer-is-a-pointer” semantics, reducing programmability and
complicating situations where the CPU and NPU (or among
multiple NPUs [27]) share data. To tackle these limitations, the
I/O MMU (IOMMU) [28] can be utilized to service accelerator-
side virtual-to-physical address translations and overcome the
aforementioned limitations.
IOMMU hardware/software architecture. The IOMMU
is assigned with the access privilege to walk the CPU’s page-
tables, allowing the CPU and the (GPU/NPU) accelerators to
share a unified global address space. When the accelerator
is not able to locate a proper translation for its virtual
address (VA), a translation request is sent as an ATS (address
translation service) request packet over PCIe to the IOMMU.
The IOMMU can include its own TLB hierarchy (called
IOTLB), which is checked first when an ATS is received.
When IOTLB misses, a hardware page-table walker (PTW)
inside the IOMMU walks the CPU page-tables to retrieve the
translated physical address (PA). Because a single IOMMU
block is designed to be shared by multiple accelerators (e.g.,
GPUs, DSPs, ISPs, and NPUs), current IOMMUs employ
multiple PTWs (typically 8 but can be up to 16), allowing
multiple translations in-flight.
C. Evaluation Methodology
Baseline NPU architecture. Our baseline NPU architecture
assumes a Google TPU-style systolic-array microarchitecture
(Figure 2), which we modeled as a detailed, cycle-level
performance simulator by cross-referencing publicly disclosed
documents and patents from Google [25], [29], [30], [31], [32].
Our performance model is cross-validated against Google
Cloud TPU [33], achieving an average 80% correlation in
terms of effective throughput. The baseline NPU model
employs a SPM based on-chip memory hierarchy and a weight-
stationary dataflow [34], as implemented in the original TPU
(Table I). Similar to discrete NPUs such as Intel-Nervana’s
Neural Network Processor [35] or Habana’s Gaudi [22], our
NPU model utilizes a local, high-bandwidth memory (e.g.,
HBM [36]). Similar to prior work [37], [38], [39], [40],
we modeled the memory system as having fixed latency
3
TABLE I: Baseline NPU configuration.
Processor architecture
Systolic-array dimension 128× 128
Operating frequency of PE 1 GHz
Scratchpad size (activations/weights) 15/10 MB
Memory system
Number of memory channels 8
Memory bandwidth 600 GB/sec
Memory access latency 100 cycles
IOMMU
Number of TLB entries 2048
TLB hit latency 5 cycles
Number of page-table walkers 8
Latency to walk page-tables 100 cycles per level
System Interconnect
NUMA access latency across sytem interconnect 150 cycles
CPU↔NPU Interconnect Bandwidth 16 GB/sec
NPU↔NPU Interconnect Bandwidth 160 GB/sec
and bandwidth rather than employing a cycle-level DRAM
simulator [41], [42] to reduce simulation time. When modeling
IOMMUs, we assume an x86-64 style, hierarchical 4-level
page-tables with key configuration parameters following those
from prior related literature [1], [3], [43]. While the remainder
of this paper assumes a systolic-array based NPU for our
discussions, the effectiveness of our NeuMMU design remains
intact for other NPU designs, such as spatial-array based
microarchitectures [37], [39], [40], [44], [45] as these NPUs
are also based on an SPM-centric memory hierarchy. We
discuss the implication of alternative NPU architectures and
DNN dataflows on our MMU proposal in Section VI-B.
Benchmarks. We study six DL applications as part of our
dense DNN workloads. We chose AlexNet, GoogLeNet, and
ResNet [46], [47], [48] as our CNN application suite (denoted
as CNN-1/CNN-2/CNN-3, respectively) because they cover a
wide range of filter and activation sizes. We also include three
RNNs from DeepBench [49], one regular GEMV (general
matrix-vector multiplication) based RNN (RNN-1) and two
LSTM based RNNs (RNN-2/RNN-3). For these workloads,
we observe an intractable amount of simulation time when
the batch size is larger than 16, so our analysis assumes a
batch size of 1/4/8 (denoted as b01/b04/b08), which is
reasonable for inference scenarios. To accommodate training
scenarios, we experiment a subset of the layers (i.e., a common
layer configuration exhibited in each of our DNN) with large
batch sizes in Section VI-C as a sensitivity study to explore
the implication of address translation on large batch training.
When studying the effectiveness of NeuMMU in handling sparse
DNN layers (e.g., embedding layers) in Section V, we use
two recommendation system models: the neural collaborative
filtering (NCF) based recommendation system [14] from
MLPerf [15] and the recently open-sourced DLRM (deep
learning recommendation model) from Facebook [13].
Page sizes (small vs. large). A key consideration in
designing a virtual memory system is its page size. Compared
to baseline 4KB pages, large (2MB) pages can potentially
reduce the translation invoked stalls by increasing TLB reach
and reducing TLB misses. As we discuss in Section VI-A,
we find that 2 MB large pages do in fact decrease the
performance overhead of address translations for conventional,
dense CNNs/RNNs which exhibits highly regular dataflows.
User1 User2 … Usern
Embedding lookup table0 (User)
User0 User3 Item2 … Itemn
Embedding lookup table1 (Item)
Item0 Item3Item1
Embedding for User0 Embedding for User3 Embedding for Item1 Embedding for Itemn
MLP Layer2
MLP Layern
MLP Layer1
Element-wise product
Concatenation
MLP Layer
Concatenation
Score
[Phase#2]
DNN 
computation
(Dense)
[Phase#1]
Embedding 
lookup
(Sparse)
Fig. 4: Neural network based collaborative filtering (NCF) [14].
Reading out embedding vectors (e.g., the yellow colored four vectors)
from the user/item embedding lookup tables is conceptually similar
to a gather operation with very low temporal and spatial locality.
Unfortunately, for emerging DL applications employing sparse
embedding layers with irregular memory access patterns
(Section III-B), we find that demand paging with large page
sizes incurs significant performance loss compared to small
pages (an average 83% vs. 99% performance loss for small
vs. large pages). Consequently, large pages alone are no
silver bullet in designing a virtual memory system for NPUs,
motivating the importance of robust address translation for
small pages. As large pages perform well for conventional
CNNs/RNNs, we assume the baseline 4KB pages for our
default evaluation. We revisit the implication of large pages
on address translations, its pitfalls for emerging, sparse DNN
layers in Section VI-A.
III. MOTIVATION: WHY MMU FOR NPUS?
A. Emerging, Memory-limited DL Workloads
A common property that conventional DL applications
share (for both training/inference) is that its working set
always fits within the tens of GBs of NPU local memory
budget – an artifact of the physically-addressed NPU memory.
However, recent studies from several hyperscalars [19], [20]
project that emerging DL workloads are heavily memory
“capacity” limited, exhibiting several hundreds of GBs of
memory footprint. DL applications such as recommendation
systems [14], word/character language models [50], and speech
recognition [51], for instance, employ embedding layers which
require several tens to hundreds of GBs of memory to store
just the model weights themselves, even for inference. Figure 4
illustrates the usage of embedding layers in recommendation
systems that incorporate neural networks [14], which are the
current state-of-the-art algorithms being deployed for news
feed, search, and ads. Facebook, for instance, stores deep
learning features (e.g., the pages a particular user liked) as
vectors called embeddings which are utilized to recommend
relevant posts or contents to users [20]. Each user has a
unique embedding vector so the total number of vectors
scale proportional to the number of users. Embedding layers
therefore house billions of weight parameters, which leads to
its tens to hundreds of GBs of memory usage. As shown in
Figure 4, recommender systems consist of two phases: 1) an
4
0
0
0
GPU0
1
1
1
GPU1
2
2
2
GPU2
(N-1)
(N-1)
(N-1)
GPU(N-1)…
1
2
1
2
1
2
1
2
(N-1) (N-1) (N-1) (N-1)
[Phase#1]
Embedding lookup
(Model-parallelism)
Communication
(all-to-all)
[Phase#2]
DNN computation
(Data-parallelism)MLP MLP MLP MLP
1 2 (N-1)0
0 0 0 0
Fig. 5: Facebook’s accelerator-centric parallelization scheme [13],
[19] as employed in its DLRM model. Each GPU is allocated with
1/N of the embedding tables, so the embedding lookup phase is
model-parallelized (e.g., GPU0 stores table0, GPU1 stores table1,
. . .). As the MLP portions are parallelized using data-parallelism,
each GPU must have all its share of embeddings ready before MLPs
are executed. Consequently, an all-to-all communication is conducted
to gather the embeddings from all the neighboring GPUs, shown
graphically where each color denotes a different element of the
minibatch and each number denotes the GPU and the embeddings
allocated to it. This figure is reproduced from an article by Facebook
announcing the open-sourcing of their DLRM model [13].
embedding “lookup” phase that gathers multiple embedding
vectors from potentially multiple lookup table (e.g., two tables
in Figure 4) to batch them into a single tensor, and 2) using
the batched tensor to execute several multi-layer perceptron
(MLP) layers. Because the model size of these embedding
lookup tables are far beyond the memory capacity limits of
GPUs/NPUs, the solutions vendors predominantly take are:
1) Host-centric approach: all the embedding lookup tables
are stored in the capacity-optimized CPU memory and
CPU is solely used for the entire inference process [13],
[52]
2) Accelerator-centric approach: model-parallelism [53] is
used to partition the embedding tables across multiple
GPU/NPU’s bandwidth-optimized memory [13], [19],
addressing the memory capacity constraints of embed-
dings.
B. NPU MMU for Remote Memory Access
Figure 5 shows a DNN-based recommendation system
parallelized in an “accelerator-centric” fashion (Section III-A).
That is, the compute-dominated MLPs are parallelized using
data-parallelism to improve performance of MLPs, whereas
the memory-capacity limited embedding tables are model-
parallelized to overcome the constraints of (only tens of GBs
of) accelerator local memory capacity. Assuming different
accelerator is allocated with a different lookup table, an all-to-
all communication is required in order to shuffle the results
of an embedding lookup of an entire minibatch on each
accelerator into parts of a minibatch of embedding lookups
on all accelerators. If the accelerators we assume here are
GPUs, they have several options that enable the all-to-all
communication process: 1) all GPUs can be passed with a
(shared) pointer to each embedding table, potentially stored
in a remote GPU’s local memory, which allows any GPU
with the pointer to directly load data in a CC-NUMA fashion
(over NVLINK [8], [54]), or 2) use P2P cudaMemcpy to
initiate direct GPU↔GPU DMA copy operations without
having to utilize host-side pinned memory as an intermediate
step. Unfortunately, neither of these options are available for
an MMU-less NPU because it does not have the ability to
address memory that is outside its local, physical memory
address space. In other words, the NPU is not able to reference
data that is not already available within its physical memory.
As such, the CPU runtime must manually copy the embeddings
from the source NPU memory to an intermediate CPU-side
pinned memory, and then do another copy of these embeddings
into the destination NPU memory. As we quantitatively
detail in Section V, such multi-step data copies and data
duplication adds significant latency, leading to an average
71% performance overhead.
Given this landscape, we argue that NPUs are in urgent need
for architectural support for robust address translations. In the
remainder of this section, we first discuss the fundamental
architectural differences between GPUs and NPUs and the
limitations of blindly employing prior GPU-centric MMUs
as-is. We then motivate the need of an NPU-optimized
MMU design based on a data-driven approach. We re-visit
the usefulness of our NPU MMU design in handling DL
applications using sparse embedding layers in Section V, which
improves the performance of an MMU-less NPU by 3.4×.
C. Data-driven Analysis of NPU MMUs
Translation bursts in SPM-centric NPUs. As discussed
in Section II-A, the data movements between main memory
and SPM are conducted in coarse-grained tile chunks, which
can be several MBs. For instance, our baseline NPU employs
10 MB of SPM each for IA and W, so the tile size of IA and
W can be as large as (10/2)=5 MB. Putting this number into
perspective, assuming NPUs have an MMU that enables VA-
to-PA translations, a single tile request by the DMA seeking
to fully populate the 5 MB on-chip SPM will need to access
a minimum of (5 MB/4 KB) = 1.2K distinct pages under the
baseline 4 KB page. The actual number of pages accessed can
be much larger than this minimum number because the DMA
is not necessarily fetching data in page-granularity in a dense
fashion (i.e., worst case, the DMA fetches only a single word
from a single page in a sparse manner). Figure 6 illustrates
the average and maximum number of distinct pages accessed
by a single tile requested by the DMA.
Note that the IA/W tiles are multi-dimensional tensors
mapped to a traditional, linear (1D) DRAM memory. Conse-
quently, a single tile tensor can be decomposed into multiple,
linearized memory transactions by the DMA unit. Each of
these memory transactions require address translation to
determine which page it belongs to, so the actual number
of translations invoked can be much larger than the number of
pages accessed (Figure 6). To make matters worse, these
address translation requests are generated in large bursts
within a short timeframe (henceforth referred to as translation
bursts), which cause significant translation bandwidth pressure
5
0500
1000
1500
2000
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
Pa
ge
-d
iv
er
ge
nc
e
(p
er
 t
ile
)
max avg
Fig. 6: Maximum/average number of distinct pages accessed for
each tile fetched by the DMA unit under 4KB pages.
0
200
400
600
800
1000
1200
0 1000000 2000000 3000000
N
um
be
r 
of
 t
ra
ns
la
ti
on
s 
re
qu
es
te
d 
w
it
hi
n 
10
00
 c
yc
le
s
Time (Cycles)
(a)
0
200
400
600
800
1000
1200
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
#
 o
f t
ra
ns
la
ti
on
s 
re
qu
es
te
d 
w
it
hi
n 
1K
 c
yc
le
s
Time (Cycles)
(b)
Fig. 7: Number of address translations requested by the DMA
assuming 4 KB pages within a consecutive 1000 clock cycle window
((a) CNN-1 and (b) RNN-1). The DMA unit sends a single translation
each cycle, so 1000 on the y-axis represent phases where the DMA
generates a burst of translations.
on the MMU (Figure 7). While these numbers might at
first glance seem surprising, we observe that this is a natural
outcome of NPU architectures optimized for data-/task-level
parallelism using an SPM based on-chip memory hierarchy.
State-of-the-art NPUs typically contain tens of thousands of
ALUs on-chip, so the SPM must be large enough to seamlessly
feed these processing engines with useful work. As there is
an implicit barrier enforced at the boundaries of a particular
tile’s compute and memory phase (i.e., any given tile(n)’s
computation can be initiated only when the entire tile is fully
fetched into the SPM, see Figure 3), the DMA unit tries
to concurrently launch the data read requests to DRAM to
maximize memory-level parallelism and fetch IA/W tiles as
soon as possible, inevitably leading to translation bursts.
Pitfalls of GPU-centric MMUs for NPUs. As convolutions
or matrix-multiplication operations are well-known to exhibit
high data reuse thanks to its regular dataflow [40], one might
think that conventional TLBs should effectively capture the
translation reuse with high TLB hit rates. However, NPUs have
fundamental architectural differences than GPUs, rendering
prior GPU-centric MMU solutions ineffective in handling
the aforementioned translation bursts. Recall that state-of-the-
art NPU architectures are based on a SPM-centric memory
hierarchy. The (PE↔SPM) data traffic do not require address
translations because SPM is addressed using VAs rather than
0.0
0.2
0.4
0.6
0.8
1.0
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
N
or
m
al
iz
ed
Pe
rf
or
m
an
ce
b01 b04 b08
Fig. 8: Normalized performance when enabling address translation
in 4 KB page granularity. The baseline IOMMU design which
contains 2048 TLB entries and 8 hardware page-table walkers were
used. Results are normalized to an oracular MMU that assumes all
translations hit in the TLB with no additional TLB access latency.
PAs. Consequently, unlike a GPU where a per-core, post-
coalescing TLB can effectively reduce a substantial amount
of GPU translation requests [1], [3], a per-PE TLB cannot
help in fitering out the translation burst bandwidth pressure of
NPUs. However, the intra-tile translation locality does exist for
(SPM↔DRAM) traffics when the DMA unit invokes multiple
data fetch requests from memory to the SPM that fall under the
same page. We observe however that such translation locality
is not adequately captured with a conventional TLB hierarchy
because the bursts of translations often query the TLB even
before the PTW delivers the VA-to-PA translations! Such
phenomenon is a unique characteristic of the SPM-centric
NPUs: for GPUs, memory read/write operations are initiated
through load and store instructions, which only amounts to
10−20% of the instruction mixes [55] and is therefore likely
to be distanced apart in time when sent over to the MMU for
address translations. The SPM-centric NPU however invokes
bursts of these translation request traffic to the MMU within
a short time-window leading to its high translation throughput
requirements.
Translation throughput vs. translation locality. In gen-
eral, we observe that NPUs that utilize a naive, strawman
IOMMU design enhanced with some key GPU-centric op-
timizations as-is (i.e., per-PE TLBs, parallel PTWs, a local
multi-level TLB hierarchy) is not able to properly handle
the NPU translation bursts as it is optimized for translation
locality rather than translation throughput, experiencing severe
performance slowdown as shown in Figure 8. In effect, the
sheer volume of translations requested to the IOMMU leads to
a large number of page-table walks, even after being filtered
by the TLBs. These massive number of address translations
eventually becomes bottlenecked by the limited parallelism
provided with a handful of (eight) shared IOMMU PTWs.
Overall, our data-driven analysis shows that conventional
MMUs which are primarily designed to capture translation
locality (i.e., TLBs), rather than translation throughput (i.e.,
number of PTWs), are inadequate in handling the translation
bursts in NPUs. To validate whether the TLB can be a primary
target for improvement in NPU MMUs, we sweep the numbers
of TLB entries on top of our baseline IOMMU with eight
hardware PTWs. Even with an unrealistically large TLB with
128K entries (64× increase over baseline 2K TLB entries,
Table I), the NPU fails to completely filter out the bursts of
translation requests, achieving less than 0.02% performance
improvement than baseline IOMMU. Overall, we conclude
6
that NPU local TLBs, while beneficial, is not sufficiently
performant enough to filter out most of the translations. This
is because the bursts of address translations cause significant
number of page-table walks instantiated and be bottlenecked
by the translation throughput provided with IOMMUs. This
is in stark contrast with GPUs where prior work [1], [3] has
shown TLBs to be effective in capturing an average 70−80%
of translations. One might think that by having the DMA
unit send data requests in a less bursty fashion (e.g., only
allow up to a limited number of data and address translations
that the IOMMU can sustain), the effectiveness of TLBs
can be restored and performance loss reduced. Unfortunately,
such design decision will inevitably reduce memory-level
parallelism and memory bandwidth utilization, significantly
slowing down memory-limited applications like RNNs.
Proposed approach: throughput-centric MMU. Based
on our data-driven analysis, we conclude that translation
throughput should be the primary design target for NPU
MMUs. This is because of the SPM-centric NPU’s unique
architectural characteristic, where the tile-based bulk DMA
transfers invoke translation bursts which are not adequately
captured using the locality-optimized, GPU-centric MMUs. In
the following section, we propose a “throughput”-centric NPU
MMU design that effectively balances translation throughput
while also adequately capturing translation locality.
IV. NEUMMU: DESIGNING AN MMU FOR NPUS
A. PRMB: Translation Bandwidth Filter
As discussed in Section III-C, a key challenge with NPU
address translation is that the SPM-centric memory hierarchy
invokes a burst of several thousands of address translations
when moving data in/out of main memory from/to SPM.
Our first proposal is based on the key observation that a
significant fraction of translation bursts hit in the same page
that is already being translated by the PTW. To capture such
translation locality within translation bursts, we propose a
PTW design enhanced with a pending request merging buffer
(PRMB) that absorbs the page translation requests falling
under an already inflight, pending translation initiated by
that same PTW. Figure 9 illustrates the microarchitecture
of the proposed PTW with our PRMB assuming that it can
merge up to 4 identical page translations per each PTW. Any
memory transaction that misses in the TLB is first routed to
the pending translation scoreboard (PTS) to check whether
any one of the N parallel PTWs is currently under the process
of translating the corresponding page. The PTS is a fully-
associative cache with N cache entries (equivalent to the
number of PTWs) and is tagged with the virtual page number
(VPN). A hit in the PTS implies that a VA→PA translation
for this particular VPN is currently inflight. If there are vacant
PRMB mergeable slots within the PTW, the PTS-hit request is
merged inside the PRMB and waits until the translation comes
back. A PTS miss however implies that neither the TLB nor
any one of the PTWs contains the translation for this VPN.
The PTS therefore assigns one of the vacant PTWs (if any)
as the designated translation unit to walk the page-tables, and
Pending Request Merging Buffer
(PRMB)
VA0 Other state
VA1 Other state
VA2 Other state
- -
(Section 4.1)
(Section 4.3)(Section 4.1)
(Section 4.2)
PRMB
Page-Table Walker
L4 L3 L2
(Section 4.3)(Section 4.1)
(Section 4.2)
PRMB
Page-Table Walker
TPreg
L4 L3 L2
(Section 4.3)(Section 4.1)
(Section 4.2)
PRMB
Page-Table Walker
TPreg
L4 L3 L2
(Section 4.3)(Section 4.1)
(Section 4.2)
PRMB
Page-Table Walker
TPreg
L4 L3 L2
(Section 4.3)(Section 4.1)
(Section 4.2)
PRMB
…
Pending Translation Scoreboard
(PTS)
Tag Data
Tag Data
Tag Data
Tag Data
Number of PTWs = N
Fully-associative cache
(N entries)
[Step#1] TLB missed request queries the PTS for hit/miss
[Step#2] 
- PTS hit? Find free PRMB slots and merge
- PTS miss? Allocate new PTW and PRMB entry
Fig. 9: Proposed PTW with a PRMB containing 4 mergeable slots.
The example assumes that the PTW is currently walking the page-
tables to translate a virtual page number of VPNn. Three translation
requests to the same virtual page are merged inside the PRMB.
registers the VPN information inside one of the PTS entries so
that future translations to this particular VPN can be merged.
When all the PTWs as well as all possible PRMB mergeable
slots are full, any further translation requests are blocked until
the translation bandwidth is available.
Because the translations that are merged inside PRMB do
not send a separate page-walk request and instead waits for the
already inflight translation request to come back, our PRMB
microarchitecture saves not only memory bandwidth but more
importantly, the PTW “translation bandwidth”, effectively
functioning as a translation bandwidth filtering mechanism.
Once the translation is available, the PTW controller queries
the PRMB and returns the merged requests back to the DMA
unit on a cycle-by-cycle basis. Figure 10 shows the effect
of PRMB on overall performance as a function of how many
merge-able entries are provisioned inside each PRMB. As
depicted, for our studied DNNs, having 8-32 mergeable
slots per each PTW can significantly capture the translation
burst locality thereby minimizing the redundant translation
requests from wasting memory and translation bandwidth. This
allows our PRMB-enhanced NPU to achieve an average 11%
(max 98%) performance of an oracular MMU, a significant
improvement over the baseline IOMMU. Nonetheless, there
still exists a significant performance gap of 89% motivating
us to our second proposition.
B. (Translation) Throughput-centric MMU
The PRMB-enhanced PTW helps capture translation burst
locality while minimizing waste in memory and translation
throughput. Nonetheless, the low average TLB hit rate and the
sheer volume of required address translations render significant
pressure on the meager 8 IOMMU page-table walkers. While
Powers et al. [1] similarly observed that enhancing parallelism
to the PTW helps improve GPU’s translation throughput,
adding more parallel PTWs was only able to achieve, on
average, 30% of the oracular MMU design point under the
7
0.0
0.2
0.4
0.6
0.8
1.0
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
N
or
m
al
iz
ed
Pe
rf
or
m
an
ce
PRMB(1) PRMB(2) PRMB(4) PRMB(8) PRMB(16) PRMB(32)
Fig. 10: Performance sensitivity to the number of mergeable slots
within the PRMB. All configurations assume the baseline setting with
8 PTWs and 2048 TLB entries as assumed in Figure 8.
0.0
0.2
0.4
0.6
0.8
1.0
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
N
or
m
al
iz
ed
Pe
rf
or
m
an
ce
PTW(8) PTW(16) PTW(32) PTW(64) PTW(128) PTW(256) PTW(512) PTW(1024)
Fig. 11: Performance sensitivity to the number of hardware page-
table walkers with 4 KB pages. All configurations assumes a 2048
entry TLB with 32 mergeable slots per each PRMB unit within each
PTW. Results are normalized against an oracular MMU.
GPU context. This is because leveraging translation locality,
using the per-core/post-coalescing TLB and multi-level TLB
hierarchy, was shown to be more important than enhancing
raw translation throughput for GPUs [1], [3].
Our work, on the other hand, makes the unique observation
that the bursty nature of NPU translation requests, coupled with
the relatively low TLB hit rates, “mandates” a throughput-
centric MMU as a primary design objective. As such, the
key insight our data-driven analysis delivers is that the SPM-
centric NPUs should be designed for improving translation
throughput first, and translation locality second. As such, our
second proposition is that the NPU MMU should be further
enhanced for translation throughput by adding a larger number
of PTWs. Figure 11 shows the NPU performance sensitivity on
address translation throughput, where increasing the number
of PTWs from 8 to 128 closes the performance gap from an
average 11% to 99% for baseline 4 KB pages: 128 PTWs
turned out to be a good design point for the set of benchmarks
we have evaluated, but larger/smaller PTWs might be required
for alternative NPU configurations. We discuss the sensitivity
of NeuMMU for alternative design points in Section VI-C. As
noted in Section III-C, the IOMMU is designed to be shared
by multiple accelerators. To make sure the NPU alone does not
saturate the address translation throughput, we argue that the
number of PTWs be sufficiently provisioned such that it does
not become a performance hotspot. Studying efficient MMU
resource allocation strategies across multiple accelerators for
QoS is beyond our scope and we leave it as future work.
It is worth pointing out that blindly increasing the number
of PTWs alone, without employing our translation bandwidth
filtering PRMB microarchitecture, can cause significant over-
heads in energy-efficiency. Figure 12(a) shows the performance
of baseline IOMMU enhanced with a larger number of PTWs
without our PRMB design adopted. With 1024 PTWs with no
PRMB, the performance does in fact match the performance
0.0
0.2
0.4
0.6
0.8
1.0
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
N
or
m
al
iz
ed
Pe
rf
or
m
an
ce
PTW(8) PTW(16) PTW(32) PTW(64) PTW(128) PTW(256) PTW(512) PTW(1024)
(a)
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
[512,8] [256,16] [128,32] [64,64] [32,128]* [16,256] [8,512] [4,1024] [2,2048] [1,4096]
N
or
m
al
iz
ed
En
er
gy
N
or
m
al
iz
ed
 
Pe
rf
or
m
an
ce
Energy Performance
(b)
Fig. 12: (a) Performance sensitivity to the number of hardware page-
table walkers “without” the PRMB microarchitecture employed. (b)
Energy consumed for address translations when NeuMMU employs
M PRMB entries and N PTWs, denoted as [M,N] in the x-axis. We
used the 1) energy table for a 45 nm CMOS process [56] to model
the DRAM access energy consumption in page-table walks and 2)
CACTI [57] for modeling PRMB access energy.
of NeuMMU with 32 PRMB and 128 PTWs. Such design
point however consumes significantly more energy as shown
in Figure 12(b). Without the translation bandwidth filtering
effects of PRMB, a significant fraction of translations that
walk the page-tables are redundant and causes up to 7.1×
more energy consumption than the nominal 32 PRMB and
128 PTWs of NeuMMU. Our novel PRMB microarchitecture
and the throughput-centric parallel PTW design effectively
balances performance and energy-efficiency, reaching 99% of
the performance of oracle while consuming much less energy
than single-handedly relying on large PTWs without PRMB.
C. Translation Path “Registers” (not Caches)
An important challenge with the NeuMMU design so far
is that a significant fraction of translations still require a
page-table walk. While the abundant address translation
throughput provided with NeuMMU design so far effectively
hides the latency of translations, the number of page-table
walks themselves are still relatively high because of the
low TLB hit rate. As our study assumes an x86-64 style,
hierarchical 4-level page-tables, a single page-table walk
operation would incur up to four memory transactions with
significant power overheads. For power-limited environment,
the overhead of adding address translations can be prohibitive
which leads to our last proposal: a lightweight translation
path “register” (TPreg) that allows PTWs to skip some
page-table walking steps. Our TPreg microarchitecture is
inspired by the well-known MMU caches [23] (aka translation
path caches), widely adopted in CPUs/GPUs, but TPreg
leverages the unique characteristics of DNNs to minimize its
implementation overheads (i.e., less than 16 bytes per PTW)
while reducing the number page-table walk invoked memory
transactions by more than 2.5×.
8
0.0
0.2
0.4
0.6
0.8
1.0
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
CNN-1 CNN-2 CNN-3 RNN-1 RNN-2 RNN-3
TP
Re
g 
H
it 
Ra
te
L4 idx L3 idx L2 idx
Fig. 13: TPC tag matching rate at L4/L3/L2 indices when only a
single TPC entry is used (i.e., TPreg).
Benefits of caching translation paths. Under x86-64 based
translation system, the paged virtual memory is implemented
using a radix tree for their page-tables. The translation path
caches accelerate the page-table walking process by allowing
the processor (in our case, the NPU PTW) to skip over a single
or more levels of the radix tree. The virtual address space is
decomposed into a page-number and a page-offset, where the
page number is further split into a sequence of indices, four
in x86-64. The first index (L4) is used to select an entry from
the root of the radix tree, which could potentially contain a
pointer to a node in the next lower level (i.e., L3) of the tree. If
a valid entry is found, the next index value is used to jump to
the next tree level, which can again potentially find a pointer
to the node in the next lower level (L2) of the tree. Such
procedure is repeated until the selected entry is invalid or the
tree search finalizes at a data page using the PA. As x86-64
currently uses 48 bits out of the memory addressable 64 bits,
a baseline 4 KB page size utilizes the lower 12 bits and the
remaining 36 bits are divided into four 9 bit indices. Because
page-table walks to two consecutive VA pages will most likely
use the same L4/L3/L2 entries, significant translation locality
exists across spatially close VA regions.
Design space of translation path caches. x86-64 processor
vendors already employ private, low-latency translation path
caches that store upper-levels of the page-table entries [58],
[59] and the tradeoffs of alternative translation path cache
designs are well-understood through prior literature [23]. A full
design space exploration of all available options for our NPU
MMU design is beyond the scope of this work. Nevertheless,
we briefly discuss two representative design points that are
inspired by translation path caches employed by CPUs from
Intel/AMD, which drives our proposed TPreg design. The
most intuitive implementation of a translation path cache is to
store page-table entries tagged by the corresponding entry’s
physical address in memory. Entries from different levels of
the (L4/L3/L2/L1) page-tables are mixed and shared inside
a unified cache, all indexed and tagged by their physical
address. Such unified page-table cache (UPTC) is known
to be adopted in AMD’s processor designs. Intel, on the
other hand, employs a translation cache design that is tagged
using the virtual address. The translation-path cache (TPC)
microarchitecture [23], for instance, is tagged by the L4/L3/L2
indices of the virtual address. Key intuition behind the TPC
design is that the three separate UPTC cache entries allocated
to keep track of the three page-table lookups can be merged
into a single TPC lookup when: 1) all physical page numbers
0x0aae60
0x0c3500
0x0dbba0
0x0f4240
0x10c8e0
0x124f80
Tile (t) Tile (t+1) Tile (t+2) Tile (t+3)
Vi
rt
ua
l a
dd
re
ss
Fig. 14: Trace of AlexNet’s virtual address regions accessed while
consecutive tiles are requested by the DMA unit.
are concatenated and merged into a single data entry, and
2) is tagged using a concatenation of the virtual L4/L3/L2
indices. Under such design, a single TPC entry corresponds
to an entire path, including all of the intermediate entries for
a given page-table walk operation.
Our study reveals that TPC is much more effective
than UPTC in capturing the NPU address translation lo-
cality. On average, the L4/L3/L2 tag hit rate of TPC was
99.5%/99.5%/63.1% across the studied workloads whereas
UPTC achieved an average 92.4% hit rate. This allows TPC
to reduce 59% less page table walks when compared to UPTC.
Translation path “registers” (not caches). Based on our
design space exploration above, we conclude that a TPC-based
translation caching to be a more robust architecture than UPTC.
As shown in Figure 13, employing a single translation path
“register” (TPreg) per each PTW (which caches the L4/L3/L2
entries as done in TPC) can capture most of the performance
benefits of translation caching while removing significant
fraction of the redundant page-table walks. As such, TPreg
can be a lightweight, cost-effective solution to reduce the
number of memory transactions for NPU page-table walks.
Below we detail the key insights behind the effectiveness of
our TPreg microarchitecture.
1) The number of distinct VA regions accessed is confined
within a handful of large segments in the VA space (i.e.,
IA and W), so translations to VA pages that fall under
the same (IA/W) segment are highly likely to share a
common L4/L3/L2 translation entry.
2) Another important observation we make is that the
DMA unit initiates tile fetch requests for IA and W
one at a time, meaning the data fetch request is not
interleaved across IA and W. This implies that the
majority of address translation requests invoked in the
memory-to-SPM data fetch process will naturally share
the L4/L3/L2 entries. Figure 14 illustrates the virtual
addresses accessed in time, confirming the high temporal
locality of address translations which translation caching
can effectively take advantage of.
3) While the upper L4/L3/L2 entries exhibit high temporal
locality, the locality in the lower entries can be low
because the VA accessed in time exhibit a streaming
access pattern as exhibited in Figure 14. This allows
a TPC-style translation cache with only a handful of
entries to be able to capture most of the performance
benefits of translation path caches (Figure 13), moti-
9
vating our lightweight translation path “register”, not a
full-blown cache microarchitecture.
Energy-efficiency improvements. While the effect of
TPreg on performance is small, its impact on energy-
efficiency is substantial. Using the energy table for a 45
nm CMOS process [56], we derive the energy overheads
of walking the page-tables for the two design points: our
NeuMMU with 128 PTWs/32 PRMB entries with and without
the single entry TPreg. Our lightweight TPreg substantially
reduces the energy-overheads by an average 2.7× thanks to
the high translation hit rates and the resulting smaller number
of memory transactions to walk the page-tables.
D. Putting Everything Together
Overall, we show that the baseline IOMMU fails to capture
the translation locality and throughput requirements of NPU
MMUs, causing an average 95% performance overhead for
baseline pages. Based on a data-driven application charac-
terization study, we motivated the need for a throughput-
centric MMU and proposed three unique solutions tailored for
the algorithmic nature of DNNs and the SPM-centric NPU
architecture. Putting all three solutions together, our NeuMMU
design incurs an average 0.06% performance overhead for
baseline/small pages when compared against an oracular MMU
that assumes all translations hit in the TLB with no additional
TLB access latency. Furthermore, NeuMMU consumes 16.3×
less energy than the baseline IOMMU, thanks to the PRMB and
TPreg, which reduce the number page-table walk invoked
memory transactions by 18.8×.
E. Implementation Overhead
We measure NeuMMU’s design overhead using CACTI
and synthesized implementations over an FPGA board. The
additional SRAM storage required for the per-PTW PRMB
and TPreg is as follows. Each PRMB entry is conservatively
estimated to be 8 bytes, so a total of (8×32×128) = 32KB
SRAM storage is needed across the 128 PTWs, 32 PRMB
entries per each PTW. Regarding the TPreg, each consumes
16 bytes so the 128 PTWs consume 2KB. The PTS is a fully-
associative cache with 128 cache entries, each entry sized
at 6 bytes. All these amount to an area of 0.10 mm2 under
32 nm with 13.65 mW of leakage power consumption when
estimated with CACTI 6.5 [57]. We also synthesize both the
baseline IOMMU and NeuMMU on a Xilinx Virtex UltraScale+
VCU1525 dev board and compare its resource usage. The
amount of additional resources NeuMMU consumes are less
than 0.01% of the available resources, incurring negligible
overheads.
V. CASE STUDY: NUMA NPUS FOR SPARSE EMBEDDING
LAYERS
As discussed in Section III-A, state-of-the-art recommen-
dation systems using embeddings exhibit highly sparse,
irregular memory accesses over a large embedding table
(Figure 4). To overcome the memory capacity bottleneck,
0
0.2
0.4
0.6
0.8
1
1.2
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
Ba
se
lin
e
NU
MA
(s
lo
w
)
NU
MA
(fa
st
)
b01 b08 b64 b01 b08 b64
NCF DLRM
La
te
nc
y 
(n
or
m
al
iz
ed
)
GEMM Reduction Else Embedding lookup
Fig. 15: Latency breakdown when running workloads using em-
beddings. Baseline MMU-less NPU (baseline) assumes remote
embeddings are copied to CPU memory first and then copied again
to the destination NPU over PCIe. We study two NUMA systems
enabled by NeuMMU: 1) NUMA over legacy PCIe system interconnect
(NUMA(slow)) and 2) NUMA over high-bandwidth GPU/NPU
interconnects like NVIDIA’s NVLINK [54] (NUMA(fast)). We
follow prior work on NUMA CPU-GPU or multi-GPU studies [2], [8],
[60] when modeling the added latency and communication bandwidth
constraints of CPU↔NPU or NPU↔NPU interconnects (Table I).
Facebook’s DLRM [13] for instance employ an accelerator-
centric parallelization strategy where the embedding table is
model-parallelized across multiple GPUs (in our case the
NPUs). This however comes at the cost of an all-to-all
communication among the GPUs (NPUs) to gather all the
embeddings from remote memory regions (Figure 5). Because
current MMU-less NPUs cannot address memory outside its
local, physical memory, an intermediate solution that facilitates
remote embedding gathers will be to have the CPU runtime
manually copy the (remote) embedding vectors to a buffer
allocated in CPU memory, which is then copied over to the
(local) NPU memory. Since the embedding vectors are sparse
with only several hundreds of bytes in size, current GPUs can
instead utilize its MMU to directly access remote GPU memory
for embedding gathers in a CC-NUMA fashion or migrate the
missing pages into its local memory, obviating the need for a
CPU-involved, intermediate data transfer and copy operation.
With our proposed NeuMMU architecture in hand, the NPU
can page-fault on missing pages mapped to a remote NPU’s
physical memory, and either 1) directly fetch the embeddings
using fine-grained NUMA accesses, or 2) migrate the missing
page into its local memory (Section VI-A). Figure 15 shows
the performance advantage of utilizing NUMA accesses for
handling sparse embedding layers. As depicted, the baseline
MMU-less NPU suffers from significant increase in latency
because of the (redundant) manual data copies over CPU
memory. NeuMMU is able to reduce the latency spent in
gathering the embeddings as this process is undertaken directly
over the (PCIe or NVLINK) system interconnect in a fine-
grained NUMA fashion, achieving an average 31% and 71%
latency reduction than baseline NPU without an MMU. These
results highlight the merits of featuring address translations
in NPUs, which we will be of utmost importance as DL
workloads evolve into having high capacity demands with
irregular memory access behaviors.
10
VI. DISCUSSION
A. NeuMMU with Large Pages
The computation and memory access characteristics of
conventional dense CNNs/RNNs are highly regular with its
tensor data (i.e., IA and W) sized at several hundreds of MBs.
We observe that 2MB large pages can substantially decrease the
performance overhead of baseline IOMMU address translations
for the dense CNNs/RNNs (average 4%, worst case 10%). Our
NeuMMU architecture successfully removes such performance
overheads as in the baseline 4KB pages. Given the high
efficiency of IOMMU based address translation under large
pages, one might presume that NPU MMUs should simply
employ large pages exclusively without baseline small pages.
However, for DL workloads that exhibit sparse data access
patterns with large memory footprint (such as our studied
sparse embedding layers), large pages can incur a much
significant performance overhead compared to small pages.
Large pages increase the data transfer size of each demand
paged request (recall that a single embedding is only hundreds
of bytes with low temporal/spatial locality, Figure 4) which,
not only causes significant (redundant) communication traffic
on the system interconnect and hurt performance, but also
wastes NPU physical memory by causing memory bloats
via internal fragmentation [61]. Rather than using NUMA
to gather sparse embeddings (as assumed in Section V),
Figure 16 summarizes the performance of small and large
pages when utilizing NeuMMU to page-fault and migrate the
missing (sparse) data using demand-paging into the NPU
physical memory. For small pages, NeuMMU performs well
to recover the lost performance and improves performance
from 17% up to an average 96% of oracle. Unfortunately,
the performance overhead of large pages cannot be recovered
with NeuMMU because of the (redundant) prefetching effects
of large pages over the sparse access patterns of NCF and
DLRM. These results highlight the importance of providing
robust address translation service for both small and large
pages for current and future DL workloads. Prior work by
Ausavarungnirun et al. [62] that synergistically combines small
and large pages concurrently can be a promising solution
to address these issues. Nonetheless, our paper focuses on
efficient address translation support so evaluating efficient
page-fault handling and demand paging solutions that closes
such performance for large pages is beyond our scope and we
leave it as future work.
B. NeuMMU with Alternative NPU Architectures
There have been numerous NPU designs proposed in
prior literature so it is challenging to define a “generic”
NPU architecture that represents all design points in such
fast-evolving space. As such, our baseline NPU assumed
Google’s systolic-array microarchitecture as it is to date
the most successfully deployed NPU design. To study the
applicability of NeuMMU on alternative NPU designs, we also
developed a cycle-level performance model that follows several
representative prior work based on the spatial architecture
0.0
0.2
0.4
0.6
0.8
1.0
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
Ba
se
lin
e
Ne
uM
M
U
b01 b04 b08 b01 b04 b08 b01 b04 b08 b01 b04 b08
4KB pages 2MB pages 4KB pages 2MB pages
NCF DLRM
N
or
m
al
iz
ed
 
Pe
rf
or
m
an
ce
Fig. 16: Performance of executing the DL workloads using sparse
embedding layers [13], [14] over baseline IOMMU and NeuMMU,
normalized to oracular MMU. The experiment assumes similar
evaluation settings as in Figure 15 except that the missing embeddings
are brought into local physical memory using demand paging.
design paradigm [37], [39], [40], [44], [45]. Our spatial-array
based NPU design is modeled similar to DaDianNao [44]
or Eyeriss [40], which employs a two-dimensional grid of
PEs, each of which contains a vector ALU that handles
dot-product operations. These spatial NPU architectures also
employ an SPM-centric on-chip memory hierarchy, which our
NeuMMU design is founded upon, and our evaluation showed
that our NeuMMU architecture is able to similarly close the
performance gap of baseline IOMMUs, only incurring an
average 2% performance overhead. We omit the results due
to space limitations.
C. Sensitivity
The robustness of NeuMMU has been studied over several
different architecture configurations as well as different appli-
cation batch sizes. For NeuMMU design space exploration, we
sweep the number of PRMB mergeable slots (1 to 32), parallel
PTWs (64 to 256), TPreg entries, and total TLB entries (128
to 2K). Across all the sensitivity studies, the performance
achieved was never less than 73% with an average 97% of
the oracular MMU. We also studied our workloads with large
batch size of 32, 64, and 128. As mentioned in Section II-C,
large batches lead to intractable amount of simulation time,
so we limit our evaluation to the common layer configuration
of each DNN. Similar to small batches, the baseline IOMMU
achieves an average 5.9% of oracle. NeuMMU successfully
closes this performance gap for large batches, reaching 99.9%
of oracular MMU. These results demonstrate the robustness
of our throughput-centric NeuMMU for SPM-centric NPUs.
VII. RELATED WORK
Our work builds on top of prior studies on hardware
and software mechanisms to accelerate MMUs. Here we
first summarize closely related work on CPU/GPU MMUs,
followed by a summary on prior literature designing ML
accelerator architectures.
Address Translation for CPUs. As application memory
footprint increases, commercial CPUs have started including
multi-level TLB hierarchies [63], [64] with per-core MMU
caches to accelerate the page-table walking process. Barr et
al. [23] explored the design space of MMU caches, showing
that the most effective ones are unified translation caches.
Prefetching translations [65], [66], [67], shared TLBs [68],
11
and MMU caches [69] has also been studied in the literature
to alleviate translation overheads in various CPU context.
Address Translation for Accelerators. There have been
some pioneering work by Power et al [1] and Pichai et al. [3]
that explored the benefits of GPUs in utilizing IOMMU for
VA-to-PA address translations. Both of these studies proposed
a per-core, post-coalescer TLB, dedicated logic to walk the
page-tables, multi-level shared TLB hierarchy, and a page
translation cache, similar to our proposal. Hao et al. [43]
studied the utilization of IOMMUs for accelerator-centric
systems, similarly proposing shared TLBs, parallel PTWs,
and MMU caches. Our work differentiates itself from all
these prior studies as we specifically target NPUs with a
carefully designed, throughput-centric MMU that is tailored
for DNNs. As discussed in Section IV, we quantitatively
demonstrated that prior translation locality-centric MMUs are
not able to sufficiently handle the translation bursts of DNNs.
Our study provides the key insight that NPU MMUs should
be designed for enhancing translation throughput, rather than
translation locality, leading to our novel PRMB and TPreg
microarchitecture on top of a massively parallel page-table
walker design.
Accelerator architectures for ML. Aside from these
closely related prior work on MMUs, there has been a
large body of prior work exploring the design of space
of ML accelerator architectures [12], [40], [44], [70], [71],
[72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82],
[83], [84], [85], [86], [87], [88], [89] with recent interest
on sparsity-optimized solutions for further energy-efficiency
improvements [37], [38], [39], [90], [91], [92], [93], [94], [95],
[96], [97], [98]. Our work on NPU MMUs is orthogonal to
these prior art as our primary focus is on adding new features
to these ML accelerator designs.
VIII. CONCLUSION
As the computation demands for DL workloads increase, we
expect NPUs to evolve into first-class citizens in heterogeneous
computing platforms. We make a case for providing address
translation capabilities for NPUs, an important first step in
evolving these devices as primary compute engines. Through
a data-driven application characterization study, we root-
cause the challenges in prior GPU-centric MMU solutions
and propose three novel architecture designs tailored for
the application behavior of DNNs. Compared to an oracular
MMU design, our proposal achieves only an average 0.06%
performance overhead while allowing CPUs and NPUs to
share a unified global address space.
REFERENCES
[1] J. Power, M. Hill, and D. Wood, “Supporting x86-64 Address Translation
for 100s of GPU Lanes,” in Proceedings of the International Symposium
on High-Performance Computer Architecture (HPCA), 2014.
[2] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler,
“Toward High-Performance Paged-Memory for GPUs,” in Proceedings
of the International Symposium on High-Performance Computer Archi-
tecture (HPCA), March 2016.
[3] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for
Address Translation on GPUs: Designing Memory Management Units
for CPU/GPUs with Unified Address Spaces,” in Proceedings of the
International Conference on Architectural Support for Programming
Languages and Operation Systems (ASPLOS), 2014.
[4] NVIDIA, “Unified Memory in CUDA 6,” 2013.
[5] HSA Foundation, “Heterogeneous System Architecture,” 2018.
[6] C. Li, R. Ausavarungnirun, C. J. Rossbach, Y. Zhang, O. Mutlu, Y. Guo,
and J. Yang, “A Framework for Memory Oversubscription Management
in Graphics Processing Units,” in Proceedings of the International
Conference on Architectural Support for Programming Languages and
Operation Systems (ASPLOS), 2019.
[7] N. Sakharnykh, “Unified Memory on Pascal and Volta,” 2017.
[8] N. Agarwal, D. Nellans, M. Stephenson, M. O’Connor, and S. W.
Keckler, “Page Placement Strategies for GPUs within Heterogeneous
Memory Systems,” in Proceedings of the International Conference
on Architectural Support for Programming Languages and Operation
Systems (ASPLOS), 2015.
[9] NVIDIA, “The NVIDIA DGX-2 Deep Learning System,” 2017.
[10] NVIDIA, “NVSwitch: Leveraging NVLink to Maximum Effect,” 2018.
[11] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog,
C. J. Rossbach, and O. Mutlu, “MASK: Redesigning the GPU Memory
Hierarchy to Support Multi-Application Concurrency,” in Proceedings of
the International Conference on Architectural Support for Programming
Languages and Operation Systems (ASPLOS), 2018.
[12] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler,
“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-
Efficient Neural Network Design,” in Proceedings of the International
Symposium on Microarchitecture (MICRO), 2016.
[13] M. Naumov, D. Mudigere, H. Shi, J. Huang, N. Sundaraman, J. Park,
X. Wang, U. Gupta, C. Wu, A. Azzolini, D. Dzhulgakov, A. Mallevich,
I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko,
S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyan-
skiy, “Deep Learning Recommendation Model for Personalization and
Recommendation Systems,” in arxiv.org, 2019.
[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
Collaborative Filtering,” in Proceedings of the International Conference
on World Wide Web (WWW), 2017.
[15] MLPerf, “MLPerf: A Broad ML Benchmark Suite for Measuring
Performance of ML Software Frameworks, ML Hardware Accelerators,
and ML Cloud Platforms.” https://github.com/mlperf/inference/tree/
master/cloud, 2019.
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in arxiv.org,
2017.
[17] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding,” in
arxiv.org, 2018.
[18] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” in
arxiv.org, 2014.
[19] J. Hestness, N. Ardalani, and G. Diamos, “Beyond Human-Level
Accuracy: Computational Challenges in Deep Learning,” in Proceedings
of the Symposium on Principles and Practice of Parallel Programming
(PPOPP), 2019.
[20] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia,
J. Law, P. Malani, A. Malevich, S. Nadathur, J. Pino, M. Schatz,
A. Sidorov, V. Sivakumar, A. Tulloch, X. Wang, Y. Wu, H. Yuen,
U. Diril, D. Dzhulgakov, K. H. an B. Jia, Y. Jia, L. Qiao, V. Rao,
N. Rotem, S. Yoo, and M. Smelyanskiy, “Deep Learning Inference in
Facebook Data Centers: Characterization, Performance Optimizations
and Hardware Implications,” in arxiv.org, 2018.
[21] Facebook, “Accelerating Facebook’s infrastructure with
Application-Specific Hardware.” https://code.fb.com/data-center-
engineering/accelerating-infrastructure/, 2019.
[22] Habana, “Gaudi Training Platform White Paper,” 2019.
[23] T. W. Barr, A. L. Cox, and S. Rixner, “Translation Caching: Skip, Don’t
Walk (the Page Table),” in Proceedings of the International Symposium
on Computer Architecture (ISCA), 2010.
[24] NVIDIA, “NVIDIA CUDA Programming Guide,” 2016.
[25] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao,
C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaem-
maghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg,
J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan,
12
H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon,
J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,
A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross,
M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham,
J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma,
E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon,
“In-datacenter Performance Analysis of a Tensor Processing Unit,” in
Proceedings of the International Symposium on Computer Architecture
(ISCA), 2017.
[26] NVIDIA, “NVIDIA Tesla V100,” 2018.
[27] Google, “Cloud TPUs: ML accelerators for TensorFlow,” 2017.
[28] AMD, “AMD’s I/O Virtualization Technology (IOMMU) Specification,”
2018.
[29] J. Ross, N. Jouppi, A. Phelps, R. Young, T. Norrie, G. Thorson, and
D. Luu, “Neural Network Processor.” Patent, 05 2015. US 9747546B2.
[30] J. Ross and A. Phelps, “Computing Convolutions Using a Neural
Network Processor.” Patent, 05 2015. US 9697463B2.
[31] J. Ross, “Prefetching Weights for Use in a Neural Network Processor.”
Patent, 05 2015. US 9805304B2.
[32] J. Ross and G. Thorson, “Rotating Data for Neural Network Computa-
tions.” Patent, 05 2015. US 9747548B2.
[33] Google, “Google Cloud TPU Beta Release.” https://cloud.google.com/
tpu/docs/release-notes, 2018.
[34] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-
Efficient Dataflow for Convolutional Neural Networks,” in Proceedings
of the International Symposium on Computer Architecture (ISCA), 2016.
[35] Intel-Nervana, “Intel Nervana Hardware: Neural Network Processor
(Lake Crest),” 2018.
[36] JEDEC, “High Bandwidth Memory (HBM2) DRAM,” 2018.
[37] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional
Neural Network Computing,” in Proceedings of the International
Symposium on Computer Architecture (ISCA), 2016.
[38] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos,
“Stripes: Bit-serial Deep Neural Network Computing,” in Proceedings
of the International Symposium on Microarchitecture (MICRO), 2016.
[39] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An
Accelerator for Compressed-sparse Convolutional Neural Networks,” in
Proceedings of the International Symposium on Computer Architecture
(ISCA), 2017.
[40] Y. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efficient
Reconfigurable Accelerator for Deep Convolutional Neural Networks,”
in Proceedings of the International Solid State Circuits Conference
(ISSCC), 2016.
[41] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle
Accurate Memory System Simulator,” 2011.
[42] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “USIMM: the Utah
SImulated Memory Module,” 2012.
[43] Y. Hao, Z. Fang, G. Reinman, and J. Cong, “Supporting Address
Translation for Accelerator-Centric Architectures,” in Proceedings of the
International Symposium on High-Performance Computer Architecture
(HPCA), 2017.
[44] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning
Supercomputer,” in Proceedings of the International Symposium on
Microarchitecture (MICRO), 2014.
[45] E. Park, D. Kim, and S. Yoo, “Energy-Efficient Neural Network
Accelerator Based on Outlier-Aware Low-Precision Computation,” in
Proceedings of the International Symposium on Computer Architecture
(ISCA), 2018.
[46] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification
with Deep Convolutional Neural Networks,” in Proceedings of the
International Conference on Neural Information Processing Systems
(NIPS), 2012.
[47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,”
in Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR), 2015.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Image Recognition,” in Proceedings of the Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[49] Baidu, “DeepBench: Benchmarking Deep Learning Operations on Dif-
ferent Hardware.” https://github.com/baidu-research/DeepBench, 2017.
[50] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu,
“Exploring the Limits of Language Modeling,” in arxiv.org, 2016.
[51] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Atten and
Spell: A Neural Network for Large Vocabulary Conversational Speech
Recognition,” in Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2016.
[52] U. Gupta, X. Wang, M. Naumov, C. Wu, B. Reagen, D. Brooks,
B. Cottel, K. Hazelwood, B. Jia, H. S. Lee, A. Malevich, D. Mudigere,
M. Smelyanskiy, L. Xiong, and X. Zhang, “The Architectural Impli-
cations of Facebook’s DNN-based Personalized Recommendation,” in
arxiv.org, 2019.
[53] A. Krizhevsky, “One Weird Trick For Parallelizing Convolutional Neural
Networks.” https://arxiv.org/abs/1404.5997, 2014.
[54] NVIDIA, “NVLINK High-Speed Interconnect,” 2016.
[55] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M.
Aamodt, and V. J. Reddi, “GPUWattch : Enabling Energy Optimizations
in GPGPUs,” in Proceedings of the International Symposium on
Computer Architecture (ISCA), 2013.
[56] M. Horowitz, “Energy Table for 45nm Process,” 2013.
[57] HP Labs, “CACTI: An Integrated Cache and Memory Access Time,
Cycle Time, Area, Leakage, and Dynamic Power Model.” http://www.
hpl.hp.com/research/cacti/, 2016.
[58] Intel, “Intel 64 and IA-32 Architectures Software Developer’s Manual
Volume 3A: System Programming Guide Part 1.,” 2018.
[59] AMD, “Developer Guides, Manuals and ISA Documents – AMD,” 2018.
[60] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel,
A. Ramirez, and D. Nellans, “Beyond the Socket: NUMA-Aware GPUs,”
in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2017.
[61] Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, “Coordinated
and Efficient Huge Page Management with Ingens,” in OSDI, 2016.
[62] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J.
Rossbach, and O. Mutlu, “Mosaic: a GPU memory manager with
application-transparent support for multiple page sizes,” in Proceedings
of the International Symposium on Microarchitecture (MICRO), 2017.
[63] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks,
M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail,
M. Smittle, and T. Ziaja, “Sparc T4: A Dynamically Threaded Server-
on-a-chip,” in IEEE Micro, 2012.
[64] Intel, “4th Generation Intel Core Processor, Codenamed Haswell,” 2013.
[65] B. L. Jacob and T. N. Mudge, “A Look at Several Memory Management
Units, TLB-Refill Mechanisms, and Page Table Organizations,” in
Proceedings of the International Conference on Architectural Support
for Programming Languages and Operation Systems (ASPLOS), 1998.
[66] A. Saulsbury, F. Dahlgren, and P. Stenstrom, “Recency-based TLB-
preloading,” in Proceedings of the International Symposium on Computer
Architecture (ISCA), 2000.
[67] G. B. Kandiraju and A. Sivasubramaniam, “Going the Distance for
TLB Prefetching: An Application-driven Study,” in Proceedings of the
International Symposium on Computer Architecture (ISCA), 2002.
[68] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared Last-level
TLBs for Chip Multiprocessors,” in Proceedings of the International
Symposium on High-Performance Computer Architecture (HPCA), 2011.
[69] A. Bhattacharjee, “Large-reach Memory Management Unit Caches,”
in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2013.
[70] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous
Machine-learning,” in Proceedings of the International Conference
on Architectural Support for Programming Languages and Operation
Systems (ASPLOS), 2014.
[71] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the
Sensor,” in Proceedings of the International Symposium on Computer
Architecture (ISCA), 2015.
[72] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Temam, X. Feng, X. Zhou,
and Y. Chen, “PuDianNao: A Polyvalent Machine Learning Accelerator,”
in Proceedings of the International Conference on Architectural Support
for Programming Languages and Operation Systems (ASPLOS), 2015.
[73] Z. Du, D. Rubin, Y. Chen, L. He, T. Chen, L. Zhang, C. Wu, and
O. Temam, “Neuromorphic Accelerators: A Comparison Between
13
Neuroscience and Machine-Learning Approaches,” in Proceedings of
the International Symposium on Microarchitecture (MICRO), 2015.
[74] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Miguel,
H. Lobato, G. Wei, and D. Brooks, “Minerva: Enabling Low-Power,
High-Accuracy Deep Neural Network Accelerators,” in Proceedings of
the International Symposium on Computer Architecture (ISCA), 2016.
[75] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and
Y. Xie, “A Novel Processing-in-memory Architecture for Neural Network
Computation in ReRAM-based Main Memory,” in Proceedings of the
International Symposium on Computer Architecture (ISCA), 2016.
[76] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen,
“Cambricon: An Instruction Set Architecture for Neural Networks,” in
Proceedings of the International Symposium on Computer Architecture
(ISCA), 2016.
[77] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A Convo-
lutional Neural Network Accelerator with In-Situ Analog Arithmetic in
Crossbars,” in Proceedings of the International Symposium on Computer
Architecture (ISCA), 2016.
[78] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
“Neurocube: A Programmable Digital Neuromorphic Architecture with
High-Density 3D Memory,” in Proceedings of the International Sympo-
sium on Computer Architecture (ISCA), 2016.
[79] R. LiKamWa, Y. Hou, M. Polansky, Y. Gao, and L. Zhong, “RedEye:
Analog ConvNet Image Sensor Architecture for Continuous Mobile
Vision,” in Proceedings of the International Symposium on Computer
Architecture (ISCA), 2016.
[80] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdan-bakhsh, J. Kim,
and H. Esmaeilzadeh, “TABLA: A Unified Template-based Framework
for Accelerating Statistical Machine Learning,” in Proceedings of the
International Symposium on High-Performance Computer Architecture
(HPCA), 2016.
[81] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. Kim, C. Shao, A. Misra,
and H. Esmaeilzadeh, “From High-level Deep Neural Models to FPGAs,”
in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2016.
[82] D. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, S. Subhaschandra,
and P. Leong, “High Performance Binary Neural Networks on the
Xeon+FPGA Platform,” in Proceedings of the International Conference
on Field Programmable Logic and Applications (FPL), 2017.
[83] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:
Scalable and Efficient Neural Network Acceleration with 3D Memory,”
in Proceedings of the International Conference on Architectural Support
for Programming Languages and Operation Systems (ASPLOS), 2017.
[84] D. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim,
A. Mishra, D. Marr, S. Subhaschandra, and P. Leong, “A Customizable
Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA
Platform: A Deep Learning Case Study,” in Proceedings of the ACM
International Symposium on Field-Programmable Gate Arrays (FPGA),
2018.
[85] Y. Kwon and M. Rhu, “A Case for Memory-Centric HPC System
Architecture for Training Deep Neural Networks,” in IEEE Computer
Architecture Letters, 2018.
[86] Y. Kwon and M. Rhu, “Beyond the Memory Wall: A Case for
Memory-Centric HPC System for Deep Learning,” in Proceedings of
the International Symposium on Microarchitecture (MICRO), 2018.
[87] Y. Kwon, Y. Lee, and M. Rhu, “TensorDIMM: A Practical Near-Memory
Processing Architecture for Embeddings and Tensor Operations in
Deep Learning,” in Proceedings of the International Symposium on
Microarchitecture (MICRO), 2019.
[88] Y. Choi and M. Rhu, “PREMA: A Predictive Multi-task Scheduling Al-
gorithm For Preemptible Neural Processing Units,” in Proceedings of the
International Symposium on High-Performance Computer Architecture
(HPCA), February 2020.
[89] Y. Kwon and M. Rhu, “A Disaggregated Memory System for Deep
Learning,” in IEEE Micro, 2019.
[90] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally,
“EIE: Efficient Inference Engine on Compressed Deep Neural Network,”
in Proceedings of the International Symposium on Computer Architecture
(ISCA), 2016.
[91] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen, “Cambricon-X: An Accelerator for Sparse Neural Networks,”
in Proceedings of the International Symposium on Microarchitecture
(MICRO), October 2016.
[92] J. Albericio, A. Delmas, P. Judd, S. Sharify, G. O’Leary, R. Genov,
and A. Moshovos, “Bit-pragmatic Deep Neural Network Computing,”
in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2017.
[93] G. Venkatesh, E. Nurvitadhi, and D. Marr, “Accelerating Deep Convo-
lutional Networks using Low-precision and Sparsity,” in Proceedings
of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2017.
[94] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong,
Y. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh,
“Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural
Networks?,” in Proceedings of the ACM International Symposium on
Field-Programmable Gate Arrays (FPGA), 2017.
[95] P. Whatmough, S. Lee, H. Lee, S. Rama, D. Brooks, and G. Wei,
“A 28nm SoC with a 1.2 GHz 568nJ/Prediction Sparse Deep-Neural-
Network Engine with >0.1 Timing Error Rate Tolerance for IoT
Applications,” in Proceedings of the International Solid State Circuits
Conference (ISSCC), February 2017.
[96] P. Whatmough, S. Lee, N. Mulholland, P. Hansen, S. Kodali, D. Brooks,
and G. Wei, “DNN ENGINE: A 16nm Sub-uJ Deep Neural Network
Inference Accelerator for the Embedded Masses,” in Hot Chips: A
Symposium on High Performance Chips, August 2017.
[97] A. Delmas, P. Judd, D. Stuart, Z. Poulos, M. Mahmoud, S. Sharify,
M. Nikolic, and A. Moshovos, “Bit-Tactical: Exploiting Ineffectual
Computations in Convolutional Neural Networks: Which, Why, and
How.” https://arxiv.org/abs/1803.03688, 2018.
[98] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W.
Keckler, “Compressing DMA Engine: Leveraging Activation Sparsity
for Training Deep Neural Networks,” in Proceedings of the International
Symposium on High-Performance Computer Architecture (HPCA), 2018.
14
