MgX: Near-Zero Overhead Memory Protection with an Application to Secure
  DNN Acceleration by Hua, Weizhe et al.
MgX: Near-Zero Overhead Memory Protection with an
Application to Secure DNN Acceleration
Weizhe Hua, Muhammad Umar, Zhiru Zhang, and G. Edward Suh
School of Electrical and Computer Engineering, Cornell University, Ithaca, NY
{wh399, mu94, zhiruz, gs272}@cornell.edu
ABSTRACT
In this paper, we propose MgX, a near-zero overhead mem-
ory protection scheme for hardware accelerators. MgX mini-
mizes the performance overhead of off-chip memory encryp-
tion and integrity verification by exploiting the application-
specific aspect of accelerators. Accelerators tend to explicitly
manage data movement between on-chip and off-chip mem-
ory, typically at an object granularity that is much larger
than cache lines. Exploiting these accelerator-specific char-
acteristics, MgX generates version numbers used in memory
encryption and integrity verification only using on-chip state
without storing them in memory, and also customizes the
granularity of the memory protection to match the granularity
used by the accelerator. To demonstrate the applicability of
MgX, we present an in-depth study of MgX for deep neural
network (DNN) and also describe implementations for H.264
video decoding and genome alignment. Experimental results
show that applying MgX has less than 1% performance over-
head for both DNN inference and training on state-of-the-art
DNN architectures.
1. INTRODUCTION
As the technology scaling slows down, computing systems
are increasingly relying on hardware accelerators to improve
performance and energy efficiency. For example, modern ML
models such as deep neural networks (DNNs) are often quite
compute-intensive and increasingly run on hardware accel-
erators [8, 29] for both performance and energy efficiency.
Similarly, hardware accelerators are widely used for other
compute-intensive workloads such as video decoding, sig-
nal processing, cryptographic operations, genome assembly,
etc. This paper proposes a novel off-chip memory protection
scheme for hardware accelerators, named MgX (Memory
guard for Xelerators), using secure DNN acceleration as a
primary example application.
In many applications, the hardware accelerators may pro-
cess private or sensitive data, which need strong security
protection. For example, ML algorithms often require collect-
ing, storing, and processing a large amount of personal and
potentially private data from users to train a model. More-
over, due to its high computational demand, both training and
inference are often performed on a remote server rather than a
client device such as a smartphone, implying that the private
data and ML models need to be stored in a remote server.
Unfortunately, in traditional computing systems, private user
data may be easily exposed or misused by the remote server
if it is either compromised or malicious.
A promising approach to provide strong confidentiality and
integrity guarantees even under untrusted software and poten-
Figure 1: Secure ML acceleration — A secure accelerator keeps
all sensitive information including inputs, outputs, training data, and ML
model parameters (weights) encrypted.
tial physical tampering is to rely on trusted hardware to create
a hardware-protected execution environment. This approach
has primarily been studied in the context of general-purpose
processors in the past. This paper considers extending this
approach to accelerators. Figure 1 illustrates the approach in
the context of a DNN accelerator. In order to protect sensi-
tive data, the secure DNN accelerator keeps all confidential
information including inputs, outputs, training data, and net-
work parameters (weights) in an encrypted form outside of a
trusted hardware boundary such as a custom ASIC, an FPGA
accelerator, or an accelerator IP in an SoC. Each secure ac-
celerator contains a unique private key that can only be used
by the accelerator hardware itself. Users can authenticate the
accelerator remotely using the corresponding public key and
a certificate from the accelerator vendor and also send their
private data and model parameters encrypted, which can only
be decrypted and processed by the trusted accelerator. The
secure accelerator also ensures that the ML computation can-
not be tampered with by protecting the integrity of off-chip
data. In this way, the secure DNN accelerator can ensure
that private user data and weights cannot be accessed by an
adversary even if they control the entire software stack on the
system that contains the accelerator or can even physically
access the off-chip DRAM.
The cryptographic protection of off-chip memory, namely
memory encryption and integrity verification, represents an
essential technology to enable the hardware-protected secure
execution environment. The off-chip memory protection also
represents the main source of performance overhead in the
traditional secure processor designs [14, 38, 50, 53]. For a
general-purpose processor, the memory protection schemes
need to be able to handle any sequence of memory accesses
to arbitrary memory locations, and typically protect memory
accesses at a cache-block granularity. Each cache block is
encrypted before written back to memory, and decrypted
on a read. To hide decryption latency, the counter-mode
encryption is often used, where a counter value (CTR) is
encrypted with a block cipher to generate an encryption pad
that is XORed with data for encryption. In secure processors,
1
ar
X
iv
:2
00
4.
09
67
9v
1 
 [c
s.C
R]
  2
0 A
pr
 20
20
the counter value is typically a concatenation of the memory
address and a version number (VN) that increments on each
write. The version number for each encrypted block is stored
in memory. To protect integrity of off-chip memory, either
a message authentication code (MAC) or a cryptographic
hash needs to be attached to each cache block in memory.
Moreover, in order to ensure freshness and prevent replay
attacks, the integrity verification requires a tree of MACs.
Unfortunately, the additional VN and MAC accesses can
lead to non-trivial bandwidth and performance overhead for
memory-intensive workloads.
In this paper, we show that memory encryption and in-
tegrity verification can be performed with almost no perfor-
mance overhead for an application-specific accelerator by
customizing protection to the accelerator-specific memory
access pattern. We make key observations that the application-
specific accelerators typically move data between on-chip and
off-chip memory at a larger granularity than a cache block,
and that the off-chip accesses are explicitly performed by the
accelerator following a relatively simple control flow. The
coarse-granularity data movements implies that the version
numbers for memory encryption and the MACs for integrity
verification can be maintained at a coarse granularity to re-
duce the overhead. Moreover, the relatively simple mem-
ory access patterns and the smaller number of version num-
bers suggest that version numbers can often be either stored
on-chip or generated from the on-chip state without storing
them in off-chip memory.
We study the memory access behaviors of DNNs such as
convolutional neural networks (CNNs) and recurrent neu-
ral networks (RNNs) for both inference and training, and
show how the version numbers can be determined even when
dynamic pruning is used. By generating version numbers on-
chip and performing protection at an application-specific
granularity, MgX can eliminate most of overhead for off-chip
memory protection; no version number is stored in the off-
chip memory, no integrity tree is needed, and each MAC/hash
protects a large amount of data instead of one cache block.
We also study the applicability of MgX for H.264 video de-
coding and genome assembly acceleration using open-source
RTL implementations, and found that version numbers can
also be calculated from on-chip state.
We evaluate the overhead of MgX in the context of secure
DNN accelerators using ChaiDNN [58], an open-source DNN
accelerator from Xilinx, as the baseline. The experimental
results show that MgX can provide memory encryption and
integrity verification with almost no overhead in both perfor-
mance and off-chip memory traffic. On the other hand, apply-
ing the existing general-purpose protection schemes lead to
20-30% overhead and even higher overhead with lower mem-
ory bandwidth. MgX also reduces the on-chip area overhead
of the traditional memory protection schemes as it does not
require any caches for version numbers (VNs) and MACs.
This paper makes the following major contributions:
• We propose MgX, a near-zero overhead memory protec-
tion scheme for accelerators. MgX minimizes the per-
formance overhead of memory protection by assigning
counter values for data and performing coarse-grained
memory protection.
• We demonstrate the applicability of MgX by showing
(a) The traditional memory encryption and integrity verification scheme — The
plaintext (U) is encrypted with the CTR which consists of the address (PA) and
a VN. The MACs and VNs associated with the encrypted data (V) are stored in
DRAM. A Merkle tree is built for the VNs to guarantee freshness.
Encrypted Data
𝑆𝑀𝑔𝑋
On/Off Chip
Boundary
DRAM
CTR = (PA, VN)
MACs
𝐴𝐸𝑆𝑘𝐸𝑁𝐶
U
V
𝐻𝑘𝐼𝑉
MAC
getVN
𝐼𝐷𝑜𝑏𝑗
VN
(b) MgX — The VN is generated on-chip by the getVN function, and therefore
eliminating the off-chip storage for VNs and the Merkle tree. The MAC is
calculated over each object to reduce the overhead of integrity verification.
Figure 2: Memory encryption and integrity verification.
a concrete implementation of MgX for DNN, and de-
tailed analyses of an H.264 video decoder and a genome
assembly accelerator.
• We evaluate the secure DNN accelerator with MgX and
show that the overhead is less than 1% for both DNN
inference and training on the state-of-the-art models.
2. MGX: LOW-OVERHEAD MEMORY PRO-
TECTION FOR ACCELERATORS
This section provides the background on the state-of-the-
art in memory protection, and presents the proposed memory
protection scheme for accelerators and its security analysis.
2.1 Memory Protection Basics
Memory protection schemes typically use symmetric-key
block cipher such as Advanced Encryption Standard (AES) [40]
to encrypt off-chip memory for confidentiality and MACs (or
hashes) for integrity.
2.1.1 Memory Encryption
For memory encryption, as depicted in Figure 2(a), existing
techniques [18, 23, 50] typically use the counter mode so that
the AES operation can be overlapped with memory accesses.
The counter-mode encryption requires a non-repeating value
to be used for each encryption under the same AES key. In
this paper, we will call this value counter. In a secure proces-
sor, the counter value is often consists of the physical address
(PA) of a data block (e.g., a cache block) that will be en-
crypted and a (per-block) version number that is incremented
on each memory write (for the block). When a data block is
written, the memory protection unit increments the version
number and then encrypts the data. When a data block is read,
the memory protection unit retrieves the version number used
to encrypt the data block and then decrypts the block. Let
kENC, U, V be the AES encryption key, plaintext, and cipher-
2
text, respectively. The AES encryption can be formulated as
follows, where || represents bit-wise concatenation.
V =U⊕AESkENC(PA||VN) (1)
Because a general-purpose processor can have an arbitrary
memory access pattern that depends on the program that is
executing, the version number for each data block, which
represents the number of writes to that block, can be any
value at a given time. As a result, a general-purpose secure
processor typically needs to store the version numbers in
memory along with encrypted data in order to determine the
correct version number for a later read. Moreover, to avoid
re-using the same counter value, the AES key needs to change
once the version number reaches its maximum, which implies
that the size of the version number needs to be large enough
to avoid frequent re-encryption. For example, the memory
encryption engine in Intel SGX [18] uses a 56-bit version
number per each 64-byte data block, which introduces 11%
storage and bandwidth overhead. In general, the version
numbers cannot fit on-chip and are stored in DRAM.
2.1.2 Integrity Verification
To prevent off-chip data from being altered by an attacker,
integrity verification cryptographically checks if the value
from the off-chip memory is the most recent value written to
the address by the processor. For this purpose, a MAC of the
data value, the memory address, and the version number can
be computed and stored for each data block on a write, and
checked on a read from DRAM. However, only checking the
MAC cannot guarantee the freshness of the data; a replay
attack can replace the data and the corresponding VN and
MAC in DRAM with stale values without being detected. In
order to defeat the replay attack, a Merkle tree (i.e., hash
tree) [16] needs to be used to verify the MACs hierarchically
in a way that the root of the tree is stored on-chip. As shown
in Figure 2(a), a state-of-the-art method [43] uses a Merkle
tree to protect the integrity of the version numbers in memory,
and includes a VN in a MAC to ensure the freshness of
data. Previous works propose to use HMAC-SHA-1 [43],
Carter-Wegman MAC [18], and AES-GCM [11] as the hash
function. Let us denote the key for hash function, plaintext,
and ciphertext as kIV ,U,V , respectively. The MAC of a data
block can be calculated as:
MAC = HkIV (V, PA||VN) (2)
The overhead of integrity verification is nontrivial as it
requires traversing the tree stored in the off-chip memory. To
mitigate this overhead, integrity verification engines typically
use a cache to store recently verified MACs.
2.2 Intuition
The main overhead of traditional memory encryption and
integrity verification comes from storing and accessing the
VNs and MACs in the off-chip memory. Hardware acceler-
ators, especially the memory-intensive ones such as video
encoding/decoding, neural network, and DNA sequencing
accelerators, often requires accessing a large amount of data
in memory. Naïvely applying the traditional general-purpose
memory protection scheme to those accelerators can lead to
non-trivial performance overhead.
Plain
-text 
Enc & IV 
Engine
Host
CPU
DRAM
Functional Unit
VN
Generator
Cipher
-text
Host 
Interface
𝑺𝑿𝒄𝒆𝒍
VN
r/wAddress
DRAM
Controller
DRAM
Interface
Accelerator
𝑺𝑴𝒈𝑿
(on-chip)
Memory 
Protection
Unit
𝑰𝑫𝒐𝒃𝒋
Figure 3: The secure accelerator architecture with MgX.
For a specialized accelerator, the memory access pattern is
also customized for a particular application. Each accelerator
has a list of application-specific data structures such as arrays
that it keeps in memory. For performance, instead of relying
on caches, accelerators often explicitly move data between
on-chip memory and DRAM at an object granularity. In most
cases, the size of an object is much larger than a cache line,
and the number of objects is relatively small compared to the
cache-line-sized blocks in memory. The overhead of memory
protection can be reduced significantly if a version number is
allocated per object instead of per cache block.
In addition to coarser memory access granularity, accel-
erators also tend to have simpler memory access patterns
compared to typical programs on CPUs. Control-intensive
applications are often not a great fit for hardware accelera-
tion, and the on-chip control unit of an accelerator needs to
manage data movements between on-chip and off-chip mem-
ory. In that sense, an accelerator’s memory access pattern
can often be encoded in a small amount of memory and the
on-chip state of the accelerator contains most of the infor-
mation needed to determined off-chip access patterns. This
implies that an accelerator itself can often determine version
numbers without off-chip memory.
We propose to leverage these observations to optimize
off-chip memory protection by increasing the granularity
of protection to match the data movement granularity and
generating version numbers from the on-chip state instead
of storing them in memory. In other words, an accelerator
or its designer needs to choose the protection granularity
and provide version numbers to a memory protection unit.
We call this memory protection scheme as MgX. If version
numbers can be efficiently determined using on-chip state at
run-time, they no longer need to be stored in DRAM, which
also makes the Merkle tree unnecessary. The performance
overhead of the memory encryption and integrity verifica-
tion in MgX is largely removed as they require no off-chip
memory accesses for the VNs and MACs for the VNs. The
only extra memory accesses come from reading and writ-
ing the MACs for verifying the integrity of data blocks. We
can also lower the MAC overhead by applying MACs at an
object granularity, where a MAC is calculated for each mem-
ory object that an accelerator reads/writes at a time. In this
way, the memory encryption and integrity verification can be
performed with almost zero overhead.
2.3 MgX Scheme
MgX provides application-specific memory protection by
matching the access granularity of an accelerator and generat-
ing VNs using an on-chip state. In MgX, the accelerator itself
3
is modified to choose the protection granularity and generate
a VN when it issues a memory request. Instead of storing the
VNs in the off-chip memory, the version number generator,
as depicted in Figure 3, holds the MgX state in an on-chip
memory and produces the VNs of objects based on the MgX
state, the on-chip accelerator state, and the object identifier
of each object. The size of the on-chip state depends on the
memory access pattern of the accelerator and the number of
objects existing in the application. The VN generator con-
sists of two main functions — version generation (getVN)
and state update functions (updateS). For memory read and
write operations, getVN calculates the version number of
an object based on the object identifier (IDob j), the on-chip
accelerator state (SXcel), and the on-chip MgX state (SMgX ).
VNIDob j = getVN(SMgX , SXcel , IDob j) (3)
The state update function is called when the on-chip state
needs to be updated. The on-chip state is updated based on
the current MgX and accelerator states.
updateS(SMgX , SXcel) (4)
As shown in Figure 2(b), once the VN for reading/writing
an object is generated, the Enc and IV engines can encrypt,
decrypt, and verify that object using the same equations in (1)
and (2). The Enc and IV engines in MgX use standard AES
counter-mode and keyed hash. As the VNs are generated
on-chip and do not need to be verified, the MgX scheme does
not need the Merkle tree in the off-chip memory. For security
and correctness, the version number generation must satisfy
the following requirements.
• security: The generated version number must be differ-
ent for each write to a particular memory address.
• correctness: The generated version number for a read
must match the value used for the most recent write to
the same address, a requirement for correct decryption.
Note that sharing a version number among multiple mem-
ory locations does not sacrifice security as the counter value
to a block cipher (counter value) in the counter mode already
includes a memory address in addition to a version number.
Also, note that generating version numbers in MgX does not
require static memory access patterns. Reads do not affect
the version number no matter how irregular they are. Writes
can also happen in an arbitrary order using one version num-
ber as long as they occur once per each address. Skipping
writes and only using a portion of an object can also be done
with one version number per object as long as the skipped
locations do not need to be read later. Finally, the version
numbers can be stored on-chip as long as they fit.
2.4 Security Analysis
Encryption – MgX uses the same AES counter-mode en-
cryption that is used by the traditional memory encryption
scheme for processors. The only difference between MgX
and the traditional scheme is that the version number in MgX
is determined on-chip instead of being stored in off-chip
memory. As long as the version number generator guarantees
that the version number is unique for each write to a given
memory location (security requirement), the counter value,
which is a concatenation of the memory address and the ver-
sion number, is different for each encryption. Therefore, the
security of the memory encryption in MgX can be reduced to
the AES counter-mode encryption, which is one of the secure
approved modes of operation [10, 34].
Integrity Verification – MgX uses a stateful MAC to pro-
tect the integrity of data in memory. The MAC includes the
address and the version number in addition to data as shown
in Equation (2). This MAC construction is identical to the
one that is used in the traditional integrity verification scheme
(shown in Figure 2(a)), and protects against replay, relocation,
and substitution attacks [39], as long as the version numbers
are unique for each write to a location. In the traditional
scheme, the version numbers need to be protected separately
using a Merkle tree because they are stored in off-chip mem-
ory. In MgX, security requirement ensures that the counter
value is unique for each memory write. Also, because version
numbers are generated on-chip, they cannot be directly tam-
pered by an attacker. Thus, the integrity protection in MgX
can be reduced to that of the chosen keyed hash function.
3. MGX FOR DNN ACCELERATION
This section introduces the background on DNNs and dis-
cusses how MgX can be applied to enable efficient memory
encryption and integrity verification for secure DNN compu-
tation even in an untrusted environment.
3.1 DNN Basics
DNNs mainly consist of six types of layers: convolutional
(conv), dense, normalization, activation, and pooling layers.
A DNN typically performs the normalization and activation
operations after each conv/dense layer followed by an op-
tional pooling operation. These four operations are often
merged and performed together in a DNN accelerator for effi-
ciency. Thus, in the context of off-chip memory protection,
we only consider the conv and dense layers in DNNs.
Inference – The DNN inference is usually executed in
a layer-by-layer fashion, where each layer takes either an
external input (e.g., the first layer) or input features generated
by the previous layer(s) to produce output features for the
subsequent layer(s). For each conv/dense layer, the DNN
accelerator fetches the input features (x) and weights (w)
from the off-chip DRAM, generates the output features (y)
by computing y = w∗x, and stores the output features back
to DRAM. The DNN inference finishes after executing the
last layer and generates a prediction for the given input.
Training – One iteration of the DNN training consists of
a forward propagation and a backpropagation. The forward
pass is essentially the same as the inference except that the
DNN training requires computing the loss with respect to the
ground truth label. After the loss is calculated, it is propagated
in a backward manner through the entire network. For each
layer, the DNN accelerator fetches the gradients from the
subsequent layer (gy), input features (x), and weights (w)
from the off-chip DRAM, computes the gradients toward the
input features (gx = gy ∗x) and weights (gw = gy ∗w), updates
the weights using the calculated gradients toward the weights
(w += −α · gw, where α is the learning rate), and stores
the gradients toward the input features back to the DRAM.
4
Network 
Definition
Memory 
Protection 
Unit
Scheduler (Microcontroller)
On-
chip
BufferFeatures
Weights
Compute
Engine
(Processing
Element
Array)
Instr.
Secure DNN Accelerator
SRAM
VN Generator
MgX State
{𝑪𝑻𝑹𝑰, 𝑪𝑻𝑹𝑾 }Address
VN
Input/
Output
𝑺𝑿𝒄𝒆𝒍𝐯𝐈𝐃
Gradients
Figure 4: The high-level architecture of a secure DNN accel-
erator — The on-chip MgX state of the VN generator consists of the input
number (CTRI ) and the model number (CTRW ).
The gradients toward the inputs (gx) are used as the output
gradients (gy) for the previous layer. The backpropagation
continues until reaching the first layer of the network.
Static and Dynamic Pruning – Many pruning techniques
have been proposed to reduce the computational cost of
DNNs while maintaining their accuracy. Most previous tech-
niques prune a network by removing the features or weights
statically [20,22,33,37]. As the static pruning approaches are
agnostic to input data at run time, the memory access pattern
remains static for any given input. A more recent line of
research is investigating dynamic pruning [3,5,25,26,41,42],
which skips redundant computations, features, and weights
dynamically at run time. As dynamic pruning exploits input-
specific characteristics, the memory access pattern may vary
for a different input. However, the variations are still limited:
dynamic pruning may skip some of the accesses that exist
statically in the network model; but it does not introduce
accesses that do not exist in the model.
3.2 Threat Model
The goal of a secure DNN accelerator is to protect the
confidentiality and the integrity of a DNN execution in an
environment where only the DNN accelerator itself can be
trusted. For confidentiality, the secure DNN accelerator aims
to protect inputs, outputs, training data, weights, and all
intermediate results. The network architecture of a DNN
model, however, is considered to be public. Following the
typical threat model for secure processors, we assume that
the internal operations and state of an accelerator cannot
be directly observed or modified by an adversary through
physical attacks whereas anything outside of the accelerator
including off-chip memory and a host processor are assumed
untrusted.
We do not consider other physical side-channel attacks
such as the power and EM side channels. However, we
believe that accurately recovering input/output values or net-
work parameters such as weights through physical side chan-
nels will be far more difficult compared to the attacks on
small cryptographic keys. We also do not consider adver-
sarial machine learning attacks that exploit weaknesses in a
model itself. However, a secure DNN accelerator can encrypt
its outputs so that only a particular user can decrypt them.
3.3 High-Level Architecture
Figure 4 illustrates the high-level architecture of a secure
DNN accelerator. As off-chip DRAM is untrusted, the se-
cure accelerator requires a memory protection unit to encrypt
confidential data stored in DRAM and detect unauthorized
changes in values stored in external memory.
Because the interface to the secure accelerator cannot be
trusted, the accelerator needs to provide support for a remote
user to establish trust and securely communicate with the
accelerator. For this purpose, the secure accelerator includes
a unique private key, embedded by a manufacturer. We as-
sume that a user obtains the corresponding public key using a
private key infrastructure as in Intel SGX or Trusted Platform
Modules (TPMs). The following commands are provided by
the secure accelerator:
• Initialization. The accelerator clears its internal state,
sets a pair of new symmetric keys for encryption and
integrity verification, enables protection mechanisms,
and establishes a secure (encrypted and authenticated)
communication channel with a user using a standard
protocol such as an SSL.
• Remote Attestation. The accelerator supports remote
attestation so that the user can verify the identity and
the state including firmware of the accelerator. The
attestation also allows a user to verify the hash of DNN
definition and the hash of the weights.
• Load Model. A user sends a DNN definition (e.g., pro-
totxt file in Caffe) and weights through the encrypted
channel. The accelerator loads the model by decrypt-
ing it with the communication key, and placing it in
protected memory that is encrypted with the memory
encryption key.
• DNN Inference and Training. The user sends infer-
ence/training data through an encrypted communica-
tion channel. The accelerator runs inference/training
using the DNN model on it, and returns the prediction
results/learned weights encrypted.
3.4 MgX Scheme for DNN Accelerator
In a DNN accelerator, weights and feature maps can be
treated as objects. Knowing the object identifier of the
weights and features and the MgX state including the DNN
model number (CTRW ) and the input number (CTRI) are suf-
ficient for constructing the counter values efficiently. CTRW
is incremented when a new model is loaded. CTRI is incre-
mented for each new input. This section shows how MgX
can be applied to both the inference and training of DNNs
with and without pruning.
3.4.1 CNN Inference
The computation of a CNN can be represented as a data-
flow graph, where each layer in the network is a vertex
and the input/output features and the weights of a layer
are edges. Each edge in the graph represents a multidimen-
sional tensor of features and weights. More specifically, we
show two representative subgraphs that are widely used in
modern DNNs — the plain feedforward networks such as
AlexNet [30] and VGG [47] and networks with a bypass path
such as ResNet [21] and DenseNet [27]. As discussed in
Section 3.1, the input and output features and weights are
usually stored in DRAM as the on-chip storage of a DNN ac-
celerator is not large enough. To compute a layer during CNN
inference, the accelerator reads the input features from the
5
𝒇𝟏𝑣𝑎
𝒇𝟐𝑣𝑏
𝒇𝟑𝑣𝑐
𝒘𝒃𝒘𝒂 𝒘𝒄
…
𝑣𝑎 𝑣𝑏 𝑣𝑐𝑓/𝑣
𝑓1
𝑓2
𝑓3
W R
W R
W
(a) Plain feedforward network.
𝒇𝟏𝑣𝑎
𝒇𝟐𝑣𝑏
𝒇𝟑𝑣𝑐
𝒇𝟏
𝒘𝒃𝒘𝒂 𝒘𝒄
…
𝑣𝑎 𝑣𝑏 𝑣𝑐𝑓/𝑣
𝑓1
𝑓2
𝑓3
W R R
W R
W
(b) Network with a bypass path.
Figure 5: The subgraph of two popular networks and the
associated timing diagrams — A vertex (v) represents a layer and an
edge (f ) represents the input or output features of a layer. Each column of
the timing diagram represents a time slot scheduled for executing a certain
vertex (layer) and each row shows the memory operations on an feature edge.
W, R, —, and empty slot stand for memory write, memory reads, no memory
operations but the edge exists in DRAM, and the edge does not exist in
DRAM, respectively. The subscript of a feature edge is the vertex ID.
off-chip memory, performs the computation in a layer, and
writes the output features to the off-chip memory, regardless
of its micro-architecture details and data reuse strategies. The
output features are written by a preceding layer and are read
as the input features by the following layer. In Figure 5, we
show the subgraphs and the corresponding timing diagrams
of a valid scheduling1 of the two popular network structures.
The weights of a DNN model are stored in the off-chip
memory and read-only during the entire execution. Therefore,
we can use a constant as the version number for all weights
until they are updated. As different DNN models can be
loaded on the secure accelerator, we add CTRW in the on-
chip MgX state to keep track of the version number of the
model. Then, CTRW serves as the version number of weights.
The output features in memory are updated after each layer
of a DNN model, and their version number needs to increase
for each write. We assign each vertex in the data-flow graph
with one fixed positive integer identifier (vID ∈ Z+). In the
case where some vertices write to DRAM more than once
(e.g., the partial results of the vertex are written back to the
off-chip memory), we assign k vIDs to the vertex, where k
is the number of times the vertex writes to DRAM. Then,
the outgoing feature edge of a vertex (i.e., output features
of a layer) can use the vID as its version number during a
single execution. For example, in Figure 5, the subscript of
each edge indicates its vID. If the feature maps are written
once at the end of each layer, vID corresponds to the layer
number. Therefore, the CNN inference needs L unique values
for the vIDs, where L is the number of vertices in the graph
(i.e., the number of layers in the CNN). The version number
cannot be reused across multiple executions. For this purpose,
we also add the total number of inputs (CTRI) received by
the secure accelerator to the MgX state, and include that in
version numbers.
Based on the above observations, the MgX algorithm for
the CNN inference is shown in Algorithm 1. For the CNN
inference, the weights and the feature maps are considered as
MgX objects. The MgX state consists of CTRI and CTRW .
For simplicity, the algorithm shows two separate getVN func-
tions for weights (getVNW ()) and feature maps (getVNF ())
instead of passing IDob j as an argument, and also shows two
1There may exist other scheduling of the data-flow graph.
Algorithm 1: The getVN and updateS functions for CNN
inference and training.
1 int getVNW (SMgX ) { return SMgX .CTRW ; }
2
3 int getVNF (SMgX , vID) { return (SMgX .CTRI || vID; }
4
5 void updateSI(SMgX ) { SMgX .CTRI++; }
6
7 void updateSW (SMgX ) { SMgX .CTRW ++; }
𝒈𝟏𝑣𝑎
𝒈𝟐𝑣𝑏
𝒈𝟑𝑣𝑐
𝒘𝒂, 𝒇𝟏 𝒘𝒃, 𝒇𝟐 𝒘𝒄, 𝒇𝟑
𝒘𝒂
∗ 𝒘𝒃
∗ 𝒘𝒄
∗
…
𝑣𝑐 𝑣𝑏 𝑣𝑎
𝑔3
𝑔2
𝑔1
𝑤𝑐
𝑤𝑏
𝑤𝑎
R R
W R R
W R R
R R/W — — — —
— — R R/W — —
— — — — R R/W
Figure 6: The subgraph of a feedforward network and the
associated timing diagrams for training.
updateS functions for CTRI and CTRW . updateSI is called
when there is a new input. updateSW is called when a new
model (weights) is loaded. getVNW simply returns CTRW .
getVNF returns the concatenation of CTRI and vID (|| repre-
sents bitwise concatenation). vID is passed as a part of the
accelerator state. The pseudo-code for the CNN inference,
which uses the MgX functions, is in Algorithm 2.
Note that the version numbers for a CNN do not need to
be stored in the off-chip memory; vID is obtained from the
on-chip CNN state. CTRI and CTRW can be kept in on-chip
registers. As only a small number of registers are needed,
they can easily be made to be large enough to avoid overflows.
CNNs with less than 256 layers will only require 8 bits for
vIDs. For 64-bit version numbers, the secure accelerator can
run 256 different inputs before changing its AES key.
Algorithm 2: The DNN inference pseudo-code with MgX—
object.ptr and object.size return the pointer and the size of an object,
respectively. STMgX (object, ptr, size, VN) encrypts the object with VN
and stores the object in the memory region (ptr, ptr+size). LDMgX (ptr,
size, VN) reads the object from the memory region (ptr, ptr+size) and
decrypts it with VN.
Input :input features xl and weights wl of layer l
Output :output feature xl+1 of layer l
8 updateSI(SMgX );
9 for l = 1; l ≤ L; l++ do
10 wl = LDMgX (wl .ptr, wl .size, getVNW (SMgX ));
11 xl = LDMgX (xl .ptr, xl .size, getVNF (SMgX , l));
12 xl+1 = ReLU(wl ∗ xl);
13 STMgX (xl+1, xl+1.ptr, xl+1.size, getVNF (SMgX , l+1));
14 end
3.4.2 CNN Training
One iteration of training consists of a forward propaga-
tion and a backpropagation. The forward propagation is the
same as the inference except that all features are required for
computing the gradients with respect to the weights during
the backpropagation. Therefore, we focus on the version
number assignments for the gradients and weights during
the backpropagation. In Figure 6, we illustrate the data-flow
6
𝒇𝟐𝑣𝑎
𝒇𝟑𝑣𝑏
𝒇𝟒𝑣𝑐
𝒘
𝑣
𝒘
𝒚𝒊
𝒇𝒊−𝟏 𝒇𝒊
𝒇𝟏
𝒚𝟑 𝒚𝟒
unroll
𝒘 𝒘
Figure 7: The subgraph of an unrolled RNN.
graph and the associated timing diagram of the backpropaga-
tion. During backpropagation, each vertex first computes the
gradients flowing to the previous vertex using the gradients
flowing to current vertex and the associated weights (e.g.,
g1 = g2 ∗wc). Then, the associated weights are updated using
the gradients flowing to current vertex and the saved features
(e.g, w∗c = wc−α ·g2 ∗ f2). There is a corresponding gradient
edge for each feature edge during the backpropagation. Simi-
lar to the features in the forward propagation, the gradients
are usually written once and read multiple times.
Based on the memory access pattern of CNN training, We
propose the version generation algorithm: Each vertex in the
data-flow graph owns one fixed integer vertex identifier (vID).
As all features are read-only during the backpropagation, we
still assign the vID of as the version number for the outgoing
feature edges of a vertex. Similarly, the corresponding gradi-
ent edge (g) of each feature edge ( f ) uses the same version
number. Each feature and gradient objects pair can use the
same version number as they are stored in different memory
locations in DRAM.
Similar to the inference, we combine vID and CTRI as the
version number for the features and gradients, where CTRI
represents the total number of executed training iterations
and is incremented when the secure accelerator starts a new
iteration. The weights still use CTRW as the version num-
ber as all weights are updated/written the same number of
times. CTRW is used to track the number of updates to the
weights. The MgX state consists of CTRI and CTRW . vID is
passed to the version generation function as the accelerator
state. The updateS functions are called when a new input is
received or the weights are updated. The getVN and updateS
functions remain the same as CNN inference in Algorithm 1.
3.4.3 RNN Inference and Training
The data-flow graph of RNNs contains a feedback loop.
However, as depicted in Figure 7, an RNN with a feedback
loop can be unrolled (or unfolded) into a feedforward net-
work, where the feedback loop is unfolded to a sequence of a
finite number of vertices (layers). After unrolling, the RNN
inference is similar to the CNN inference in Figure 5 except
that the number of inputs and outputs in RNNs can be greater
than one. As the input and output features remain the same
as CNNs, we can apply the same MgX algorithm for RNN
inference.
As there may be many outputs in an RNN, the backpropa-
gation of an RNN can be viewed as repeating the backprop-
agation of a CNN many times. Specifically, the loss of an
RNN can be written as the sum of the losses for each output
(L (W, yˆ,y) = ∑i−yilogyˆi), where yi and yˆi are the output
and the ground-truth of the vertex i. The loss of each output
should be back-propagated to the vertices before this output.
Thus, the version generation algorithm can be extended to
handle the RNN training.
3.4.4 Static and Dynamic Pruning
Table 1: The number of features map accesses per image for
ResNet-50 on ImageNet with different pruning algorithms.
Pruning Technique CSR/CSC RLC ChannelPruning
min. # of accesses (103) 1936 1322 1796
max. # of accesses (103) 2239 1426 1827
Static pruning approaches still result in a static network
model. The same version generation approach can be ap-
plied to the pruned model to determine the version numbers.
Therefore, MgX is applicable to the statically pruned DNN
models. At a glance, it may appear that the MgX scheme
does not work for dynamic pruning, which skips memory
accesses for some features and weights at run time. How-
ever, skipping version numbers does not affect the security
of memory encryption or integrity verification as long as the
version numbers are not reused. The decryption and integrity
verification will also be functionally correct as long as a write
and the corresponding reads use the same version number.
To verify the functionality of applying MgX on networks
with both static and dynamic pruning, we implemented a
variety of pruning techniques in PyTorch and emulated the
MgX encryption in software using the proposed version gen-
eration approach. The PyTorch implementation follows the
same data movement strategies in DNN accelerators where
the partial results of features are kept on-chip to maximize
the locality and the final results of features are stored in the
off-chip memory. For pixel-level dynamic pruning, we imple-
mented different zero compression techniques such as Com-
pressed Sparse Row (CSR) [4], Compressed Sparse Column
(CSC) [20, 51], and Run-Length Compression (RLC) [7, 41].
We also tested a threshold-based channel-level dynamic prun-
ing scheme similar to [15], where the entire feature channel
is pruned if over 90% of the features are zero. Algorithm 3
shows the pseudo-code of a DNN layer with dynamic pixel-
leveling pruning, where the feature objects are stored in CSR
format. The size of the feature object is input-dependent and
only determined at run time. The feature object consists of
three data structures — value (Vx), row pointer (Rx), and col-
umn index (Cx). Vx, Rx, and Cx all share the version number
of the corresponding feature object. The pseudo-code shows
that the version numbers of the sparse features in CSR format
only depend on the layer number and the MgX state even
though memory accesses are dynamically determined.
3.5 Hardware Implementation
Compared to the traditional memory protection scheme,
which requires an on-chip version number cache to reduce the
number of off-chip version number accesses, MgX requires
much less on-chip hardware resources. In addition to the
encryption and integrity verification engine, MgX only needs
two on-chip registers to store CTRI and CTRW . The value
of the vID is the layer ID, which can be extracted from the
on-chip control unit state. CTRI is incremented when receiv-
ing new inference or training data. CTRW is incremented
when loading a new model during inference or updating the
weights during training. The integrity verification engine can
be programmed to calculate the MAC of the features and
weights at the granularity of k bytes, where k is the maximum
7
Algorithm 3: The pseudo-code of a DNN layer l with dy-
namic pixel-leveling pruning using compressed sparse row for-
mat —m, i, l are the DNN model number, the input number, and the
layer number, respectively. To simplify the code, we only show the
inner loops for computing one output channel. SpVV is the sparse
vector vector product operation. LDMgX and STMgX are defined in
Algorithm 2.
Input :input feature xl with cl channels, weight kernel wl
with cl channels, row pointer of cl feature channels
pr, column index of cl feature channels ic, and
SMgX = m || i
Output :one channel of the output feature xl+1
15 for j = 0; j < cl; j++ do
16 w = LDMgX (wl [ j].ptr, wl [ j].size, getVNW (SMgX ));
17 Vx = LDMgX (xl [ j].ptr, xl [ j].size, getVNF (SMgX , l));
18 Rx = LDMgX (pr[ j].ptr, pr[ j].size, getVNF (SMgX , l));
19 Cx = LDMgX (ic[ j].ptr, ic[ j].size, getVNF (SMgX , l));
20 y += SpVV(Vx, Rx, Vx, w);
21 end
22 xl+1 = ReLU(y);
23 for k = 0; k < xl+1.size, k++ do
24 if xl+1[k]> 0 then
25 STMgX (xl+1[k], xl+1[k].ptr, 1, getVNF (SMgX , l+1));
26 end
27 end
Figure 8: Timing diagram of a conv layer in CHaiDNN, show-
ing double buffering for features and weights.
common divisor of the number of bytes written to and fetched
from the off-chip memory at a time. Because the MACs are
checked infrequently at a coarse granularity, we found that
MgX is efficient enough even without an on-chip VN and
MAC cache to exploit locality.
4. EXPERIMENTAL RESULTS
4.1 Methodology
DNN Accelerator – We use a publicly available acceler-
ator from Xilinx to perform an experimental evaluation of
MgX and secure DNN acceleration. CHaiDNN [58] is an
open-source HLS-based DNN inference accelerator library
for Xilinx Zynq UltraScale+ MPSoC devices. This frame-
RTL simulator
Total latency,
Bandwidth
overhead
Memory 
protection 
simulator
HLS-generated 
RTL
DRAMSim2
req.
lat.
Caffe
model
events
Perf.
evaluationt, lat.
Figure 9: The block diagram of the cycle-level simulator for
the secure DNN accelerator.
work can execute a complete DNN using a network definition
and model in Caffe [28] format, and supports most of popular
layer types. The CHaiDNN accelerator relies on parallelism
to deliver high throughput. It consists of an array of pro-
cessing elements (PEs), which utilize double-pumped DSP
blocks (i.e. clocked at twice the normal operating frequency)
to perform multiply–accumulate operations, and is specifi-
cally built for 6-bit/8-bit quantized operands.
To hide the memory access latency for a single batch,
CHaiDNN exploits double buffering across several hierar-
chical levels. Figure 8 illustrates the double buffering in
CHaiDNN. At a high level, the memory accesses for both
input and output features are decoupled from computations.
The accelerator overlaps loading weights for the next block
with the computation for the current block. In that sense,
the memory latency for loading weights is hidden by the
computation latency or vice versa.
For our experiments, we synthesize the CHaiDNN acceler-
ator into RTL using Vivado HLS 2019.1, and use RTL simula-
tions of CHaiDNN combined with cycle-level simulations for
DRAM to evaluate performance. For the HLS synthesis, we
target a Xilinx UltraScale+ MPSoC device (XCZU9EG aboard
a Xilinx ZCU102). In all our experiments, we use the accel-
erator configuration using 1024 DSPs, generating 16 output
feature maps and 32 output pixels in parallel, operating on
the 8-bit integer data type, and targeting the frequency of
100/200 MHz for the IP/DSP blocks, respectively. For our
chosen device, CHaiDNN allocates 512×2 KB for the input
buffers, 288×2 KB for the output buffers, and 128×2 KB for
the weight buffers.
To evaluate the performance of the existing memory protec-
tion scheme [18, 38, 43, 50] and MgX, we built a cycle-level
simulator, as depicted in Figure 9. The simulator models the
three main components — accelerator, memory protection,
and off-chip memory. The CHaiDNN accelerator which in-
cludes the scheduler, on-chip buffer, and compute engine is
simulated in a cycle-accurate RTL simulator. The RTL simu-
lator generates computation and memory events with a start
time and an event ID, where all parallel events have the same
event ID. In addition, the latency of the computation and the
address and the type (read or write) for each off-chip memory
access are included for the computation and memory events,
respectively. Then, the generated events are received by a
memory protection simulator in Python. For each memory
event, the memory protection simulator sends the original
memory requests as well as the additional meta-data accesses
required for memory encryption and integrity verification to
the DRAMSim2 [44]. The DRAMSim2, which simulates the
memory controller and DRAM, takes the memory request
from the memory protection simulator and returns the mem-
ory latency of each access. Once an event is finished, the
start time and the latency of the event are forwarded to a
performance evaluation module to calculate the total DNN
execution time and the bandwidth overhead of the memory
protection schemes.
For MgX, the VNs and MACs of the data are 64-bit. For the
baseline protection scheme, we implement a multi-level 8-ary
Merkle tree proposed by the recent work [18] to represent the
state-of-the-art. Note that this baseline uses version numbers
in each level of the Merkle tree instead of only using MACs
8
as in the traditional scheme shown in Figure 2(a). The root
of the Merkle tree is stored in on-chip whereas the rest of the
tree is stored in DRAM. Following the previous work [18],
the VNs and MACs of the VNs in the baseline protection
scheme are 56-bit each, which allows storing eight VNs and
their associated MAC within one cache line. The baseline
protection works at a 64-byte granularity, which is common in
secure processors [14,50,53]; one version number is assigned
for a 64-byte data block and a MAC is calculated for a 64-byte
block. The baseline protection scheme also includes a 4-KB
on-chip cache for VNs and MACs. The VN/MAC cache uses
the LRU replacement policy along with write-back and write-
allocate policies. To study the ideal case for the baseline
protection, we simulate a fully-associative cache for VNs and
MACs. For the off-chip memory access, we simulate one,
two, and four 64-bit DDR3 channels at 666MHz, where each
channel contains one rank of eight banks. The total capacity
of the simulated DRAM is 8GB. The DRAM parameters are
verified against Verilog timing models from Micron.
Benchmarks – We evaluate the performance and band-
width overhead of the existing protection and MgX on a
variety of neural network architectures — LeNet, AlexNet,
GoogleNet, and ResNet-50. For the DNN inference, we di-
rectly leverage the CHaiDNN accelerator framework as it can
run the DNN inference on different networks.
CHaiDNN does not support DNN training yet, and we
could not find any open-source DNN training accelerator. To
get a rough estimate of the performance of DNN training,
we approximate the backpropagation based on the forward-
propagation (inference) in CHaiDNN. During the inference,
we obtain the output features of layer i ( fi) using the input
features ( fi−1) and the weights (wi), where fi = wi ∗ fi−1.
Similarly, we compute the gradients with respect to the input
features (gi−1) using the gradients with respect to the output
features (gi) and weights (wi), where gi−1 = wi ∗gi. As the gi
and gi−1 have the same shape as fi and fi−1, we can approx-
imate the backpropagation with the inference. The weight
update in the backpropagation is not emulated as no similar
operations can be found during the inference.
4.2 Memory Traffic Increase
As the DNN execution is memory intensive, the through-
put of a DNN accelerator can be limited by the memory
bandwidth (memory bound). Therefore, we first compare
the memory traffic increase of the secure DNN accelerator
with the existing memory protection scheme and MgX. The
memory traffic increase is defined as the ratio between the
total number of memory accesses with and without memory
protection. We refer to the accelerator without protection as
no protection (NP) and with the existing memory protection
scheme as baseline protection (BP), respectively.
Figure 10 compares the memory traffic increase of the
baseline protection and MgX for the DNN inference and
training on different networks. The baseline protection intro-
duces 29.0% and 33.9% more memory accesses on average
for inference and training, respectively. The memory traffic
increase of the DNN training is larger than the inference as
the training process accesses more data and has more fre-
quent cache evictions in the VN/MAC cache. MgX incurs at
most 1.2% additional memory accesses, demonstrating the
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.0
1.1
1.2
1.3
M
em
or
y 
Tr
af
fic
 In
cr
ea
se
NP MgX BP
(a) Inference.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.0
1.2
1.4
M
em
or
y 
Tr
af
fic
 In
cr
ea
se
NP MgX BP
(b) Training.
Figure 10: The memory traffic increase of the DNN infer-
ence and training on different network models — We use two
64-bit DDR channels for all accelerators. For the baseline protection scheme,
we use an 8-ary Merkle tree which covers 128 MB of encrypted memory
and a 4-KB fully-associative VN/MAC cache.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.26
1.28
1.30
1.32
M
em
or
y 
Tr
af
fic
 In
cr
ea
se
1 KB $ 2 KB $ 4 KB $
(a) Cache size.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.26
1.28
1.30
M
em
or
y 
Tr
af
fic
 In
cr
ea
se
128 MB 1 GB 8 GB
(b) Memory size.
Figure 11: The impact on the memory traffic increase of
the baseline protection scheme with different architectural
parameters — We fix the number of DDR channels to be two and choose
different sizes for caches and the encrypted memory region (i.e., the number
of levels in the Merkle tree).
advantage of MgX over the baseline protection scheme. The
average memory traffic increase of MgX are 0.8% and 0.2%
for inference and training, respectively. MgX has almost no
increase in off-chip memory accesses because it does not
require any version numbers to be stored in off-chip memory
and also uses each MAC to protect 1-KB data as feature and
weight are accessed in the granularity of several KBs.
As discussed in Section 3.5, MgX has the advantage that it
requires minimal hardware changes (e.g., no on-chip cache).
However, as the baseline protection scheme needs an on-chip
cache and a Merkle tree, we further explore different archi-
tectural parameters in the baseline protection scheme. Specif-
ically, we choose three different cache sizes (1KB, 4KB, and
8KB) and three different depth of the Merkle tree (four, five,
and six). Four-level, five-level, and six-level Merkle tree can
cover 128MB, 1GB, and 8GB of memory, respectively.
As depicted in Figure 11(a), increasing the cache size
from 1KB (16 entries) to 2KB (32 entries) helps reducing
the additional memory traffic. If the VN/MAC cache is too
small, the top-level version numbers in the Merkle tree will
be evicted before the low-level version numbers since they
are accessed less frequently. However, if we further increase
the size of the cache, the additional benefit decreases as
the spatial locality of most version numbers have already
been exploited. Because the DNN accelerator has a largely
streaming memory access pattern, there is not much benefit
in increasing the VN/MAC cache unless it is big enough to
leverage temporal locality across layers. In Figure 11(b), we
show the memory traffic increase of the baseline protection
scheme as the size of the protected memory varies, which
is equivalent to varying the number of levels in the Merkle
trees. For a deeper Merkle tree, more version numbers are
accessed for the memory encryption and integrity verification.
However, as the version numbers are held in the on-chip
9
Table 2: The execution time of the baseline DNN accelerator
for inference and training on four different networks — The
execution time is reported in millisecond.
# of DDR
Channels
Network Architecture
LeNet AlexNet GoogleNet ResNet
Inf.
1 0.208 33.723 95.184 227.151
2 0.159 16.542 51.621 131.883
4 0.147 14.161 46.103 118.128
Train. 2 0.652 66.792 214.734 535.853
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.0
1.1
1.2
N
or
m
al
iz
ed
 E
xe
cu
tio
n 
Ti
m
e NP MgX BP
(a) Inference.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
1.0
1.1
1.2
N
or
m
al
iz
ed
 E
xe
cu
tio
n 
Ti
m
e NP MgX BP
(b) Training.
Figure 12: The total execution time of the DNN inference
and training on different networks models — We use two 64-bit
DDR channels for all accelerators. For the baseline protection scheme, we
use a four-level 8-ary Merkle tree and a 4KB cache.
cache, we find that varying the protected memory size only
has a minor impact on the memory traffic.
4.3 Performance Overhead
Table 4.3 shows the execution time of the baseline ac-
celerator without memory protection for DNN training and
inference on different networks. In the following study, for
both inference and training, we normalize the execution time
of all other accelerators to the execution time of the baseline
accelerator with two DDR channels.
Figure 12 shows the normalized execution time of the
baseline protection scheme and MgX. The baseline protec-
tion scheme is 1.15× and 1.24× slower than the accelerator
without memory protection. MgX achieves near-zero perfor-
mance overhead compared to the baseline accelerator. The
performance overhead of MgX is less than 1% for both train-
ing and inference on all benchmarks. It is worth mentioning
that the performance overhead is smaller than the memory
traffic increase. The reasons are twofold: 1) some DNN lay-
ers are compute-bound and the increased memory traffic does
not increase latency of such layers; 2) the memory writes can
execute in background and may not affect the performance.
Figure 13(a) and Figure 13(b) show the impact of varying
architecture parameters for the baseline protection. The per-
formance difference of having different cache and memory
sizes is less than 1.2% and 0.2%, respectively. Last, we com-
pare the overhead of memory protection when the number
of DDR channels varies. Figure 14 shows the normalized
execution time with a varying number of DDR channels. The
performance of MgX is almost the same as that of the baseline
accelerator for all configurations. For the baseline protection
scheme, however, the performance overhead is significantly
higher with only one DDR channel as the accelerator be-
comes more memory-bound. The execution time overhead
is reduced with more DDR channels as more DNN layers
become compute-bound with higher memory bandwidth.
4.4 Area Overhead
Both the encryption and integrity verification engines can
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
0.99
1.00
1.01
N
or
m
al
iz
ed
 E
xe
cu
tio
n 
Ti
m
e 1 KB $ 2 KB $ 4 KB $
(a) Cache size.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
geo
me
an
0.996
0.998
1.000
1.002
N
or
m
al
iz
ed
 E
xe
cu
tio
n 
Ti
m
e 128 MB 1 GB 8 GB
(b) Memory size.
Figure 13: The impact on the total execution time of the
baseline protection scheme with different architectural pa-
rameters — We fix the number of DDR channels to be two and choose
different sizes for caches and the encrypted memory region.
LeN
et
Ale
xNe
t
Go
ogl
eNe
t
Res
Net
1
2
3
N
or
m
al
iz
ed
 E
xe
cu
tio
n 
Ti
m
e
1.
31
2.
04
1.
84
1.
72
1.
00
1.
00
1.
00
1.
00
0.
92
0.
86
0.
89
0.
901
.3
3 2
.0
7
1.
88
1.
75
1.
01
1.
01
1.
01
1.
01
0.
93
0.
86
0.
90
0.
90
1.
63
2.
93
2.
43
2.
13
1.
05 1.
20
1.
17
1.
16
0.
94
0.
89
0.
92
0.
93
NP-DDR1
NP-DDR2
NP-DDR4
MgX-DDR1
MgX-DDR2
MgX-DDR4
BP-DDR1
BP-DDR2
BP-DDR4
Figure 14: The execution time of the DNN inference with
one, two, and four DDR channels.
be incorporated into a DNN accelerator without incurring
high overhead in terms of area. For instance, a GMAC-
AES256 IP core from the Xilinx Vitis Security Library [57]
is able to provide upwards of 8.0 Gbps throughput using
26,430 LUTs, 18,900 FFs, and 2 BRAMs on a Xilinx Ultra-
Scale+ device. The CHaiDNN accelerator described above
uses 161,983 LUTs, 151,616 FFs, and 676 BRAMs. This
translates to 16.3%, 12.5%, and 0.3% overhead in LUTs, FFs,
and BRAMs, respectively. Because MgX does not require
on-chip caches to store VNs or MACs, the area overhead
of MgX comes primarily from the necessary encryption and
integrity verification engines.
5. APPLICABILITY OF MGX
We believe that MgX is also applicable to other accelera-
tors. For example, linear regression and SVM have regular
memory access patterns, similar to a one-layer neural net-
work. In this section, we show a couple of concrete examples
to demonstrate the general applicability of MgX.
5.1 H.264 Video Decoder
We studied H.264/AVC video decoding [1] as a potential
candidate for MgX memory protection. Figure 15 shows
a typical H.264 decoder architecture, which transforms an
input bitstream into video frames. The input bitstream is
typically encrypted with the standard counter mode [2]. The
decoding process outputs different kinds of frames. Whereas
I (intra-coded) frames are independent, the P (inter-predicted)
frames are calculated using previous frames as a reference. B
(bi-directional) frames use later frames as a reference, leading
to out-of-order decoding.Therefore, multiple decoded frames
are kept in off-chip memory buffers and if needed, are re-read
by the inter-prediction stage.
To study how MgX can be applied to a H.264 decoder, we
analyzed an open-source implementation [36]. This decoder
stores the decoded and reference frames in external mem-
ory, and supports the Main H.264 profile, which can have B
frames. Figure 16 shows the reference dependencies in an
10
H.264 Decoder
Intra-prediction
Deblocking filter
Bitstream parser
Inter-predictionOn-chip 
circular buffer
Residue 
decoding
Sum
DRAM Bitstream
Reference Frames
(previously decoded frames)
Decoded Frame
Boundary strength decoding
Figure 15: Block diagram of a typical H.264 decoder and
off-chip memory accesses.
I B P B I B P
0 1 2 3 4
0 2 1 4 3Decoding Order:
Frame Number:
Frame Dependence
5 6
6 5
(arrowheads mean predicts)
Figure 16: H.264 decoding — example for the Main profile.
example sequence of repeated IBPB frames.
The decoder writes an output frame to an available buffer
in external memory, but writes only once to an address in each
frame. When a frame is used as a reference, it is read-only.
Thus we can simply use the frame number (F) concatenated
with the input bitstream number (CTRI) as the version num-
ber when writing an output frame. CTRI is a part of the
MgX state, and is incremented when a new video bitstream
is loaded for decoding.
getVN(SMgX , F) = { return (SMgX .CTRI || F); }
updateS(SMgX ) = { SMgX .CTRI++; }
The inter-prediction block can generate the version number
for reading previously decoded frames based on the current
frame number (F). For the IBPB sequence in Figure 16,
a P frame reads only from the last I frame – hence need-
ing to call getVN(SMgX , F−2). Note that the frame number
represents the display order of the frames, not the order of
decoding. For decoding a B frame, frames from both direc-
tions are read; the version numbers can be obtained by calling
getVN(SMgX , F−1) and getVN(SMgX , F+1).
We added the MgX encryption to the H.264 decoder and
performed an RTL simulation and verified functional correct-
ness. The memory access pattern is illustrated in Figure 17
where there are three frame buffers in memory, one for the
currently decoded frame and two for reference frames. The
blue dots indicate writes and the pink dots indicate reads.
Because the frame number (F) increments after writing each
frame, our scheme ensures that a version number is different
for each write to a memory location. While not clear from
the figure due to a limited resolution, we verified that each
location in the output buffer is written only once per frame.
The figure also shows that MgX can handle a dynamic and
irregular read pattern.
5.2 Genome Alignment Accelerators
In this study, we consider off-chip memory protection for
Darwin [54], which is an accelerator for genome assembly.
While Darwin also relies on a CPU to perform certain initial-
ization operations and control the hardware acceleration, we
assume that the CPU and its communications with Darwin
are protected separately with a secure computing technology
Time
0
2
4
6
8
Ad
dr
es
s
10 5
Frame 0 (I)
VN = 0
Frame 2 (P)
VN = 2
Frame 4 (I)
VN = 4
Frame 3 (B)
VN = 3  VN = 1
Frame 1 (B)
The writes are non-overlapping
Reads Writes
Figure 17: Memory access
pattern of an H.264 decoder
Time
A
dd
re
ss
Reads Writes
Query 
Traceback 
Pointers 
Reference 
Figure 18: Memory accesses
in a GACT hardware call.
Software
/CPU D-SOFT
Darwin (Genomics Coprocessor)
On-chip 
buffer
DRAM
Seed 
Pointer 
Table (4GB)
Position 
Table (16GB)
Reference
(4GB)
Query 
Sequences
(6GB)
Traceback
Pointers
(2GB)
Seeds
GACT Array
On-chip 
buffer
Candidate 
Positions
Read 
sequences
Write TB 
pointers
Alignment instructions
Load reference 
and construct 
seed tables
Read 
seed hits
Figure 19: The Darwin genome co-processor.
(e.g., Intel SGX) and focus on protecting memory accesses
for the accelerators in this discussion.
Figure 19 shows the components and data accesses in
Darwin. Darwin consists of two hardware-accelerated parts,
D-SOFT and GACT, which use five types of data in off-
chip memory: query sequences, reference sequences, a seed-
pointer table, a position table, and traceback pointers. For a
reference-assisted assembly, the reference sequence, the seed-
pointer table, and the position table are loaded (written) into
memory once by a CPU, then only read by the accelerators.
Therefore, the version number for these three objects can
be obtained simply from a counter in the MgX state, which
increments on each new genome assembly (CTRgenome).
After initialization, the CPU loads a batch of query se-
quences into memory and runs D-SOFT and GACT on the
accelerator for each query in the batch. Again, the query and
reference sequences are only read by the accelerator. Then,
as an output, GACT writes traceback pointers sequentially
into memory for each query. For the query sequences and
traceback pointers, we can keep another counter that incre-
ments for each new query batch (CTRquery) in the MgX state,
and use (CTRgenome || CTRquery) as the version number.
The GACT part of Darwin is available as open-source. We
validated the functional correctness and the security (no reuse
of version numbers) for GACT memory accesses through
RTL simulation – see Figure 18.
6. RELATED WORK
Memory Encryption and Integrity Verification – Re-
cent designs for memory encryption [49, 59] propose to use
the counter-mode and smaller version numbers to optimize
memory encryption. For integrity verification, several recent
efforts [12, 19, 43, 52] propose counter-based integrity-tree
design to reduce the performance overhead. Morphable coun-
ters [45] further reduce the overhead by compressing the
counters and enabling a 128-ary hash tree. Another line
of research attempts to optimize the integrity tree traversal.
Prior works [16, 31] propose to store the version numbers in
the last level cache to exploit the locality. Alternative de-
signs [32, 46, 56] propose to reduce the latency of integrity
11
verification by predicting version numbers or using an un-
verified version numbers speculatively. MgX is built on the
previous encryption and integrity verification schemes, but
shows that the off-chip version numbers can be completely
eliminated for accelerating DNNs.
Side-channel Attacks and Protection – MgX protects
the confidentiality and the integrity of off-chip data, but does
not prevent side-channel attacks. The DNN accelerator with-
out dynamic pruning has a fixed memory access pattern and
execution time, and is secure against memory and timing side
channels. However, countermeasures for other side channels
will be necessary for secure acceleration if protection against
physical side channels is desired.
A variety of side-channel attacks have been shown to work
against DNN accelerators. Memory and timing side-channels
have been used to infer the underlying network structure of
an accelerator with encrypted weights [24, 60]. To counter
such an attack, ORAM [13, 17, 48] offer a strong security
guarantee for memory access obfuscation with higher over-
head. Physical side-channel attacks on DNNs have been also
recently exploited. A power side-channel attack has been
used to retrieve the input image from a DNN accelerator [55].
Electromagnetic side-channel emanations have been used to
recover the entire network topology including weights, albeit
on a microcontroller-based inference engine [6].
Fully Homomorphic Encryption – MgX provides hard-
ware memory protection for secure accelerators in an un-
trusted environment. Alternatively, fully homomorphic en-
cryption (FHE) can provide much stronger protection by per-
forming all computations in an encrypted format. Users send
encrypted data and receive encrypted results with an expec-
tation that adversaries cannot obtain decrypted data. While
fully homomorphic encryption algorithms provide strong
cryptographic guarantees without trusting any remote hard-
ware or software, they come with significant performance
overhead [9, 35]. We believe that the secure accelerator still
provides a valuable design point that provides hardware-
based security with much lower performance overhead.
7. CONCLUSION
In this paper, we propose a novel off-chip memory protec-
tion scheme for hardware accelerators, named MgX, with a
particular focus on enabling secure DNN acceleration. We
discuss the detailed implementation of MgX on DNN accel-
erators for both inference and training. Our experimental
results show that applying MgX only adds less 1% perfor-
mance overhead on multiple DNN models.
REFERENCES
[1] “Information technology âA˘Tˇ Coding of audio-visual objects âA˘Tˇ Part
10: Advanced Video Coding,” International Organization for
Standardization, Geneva, CH, Standard ISO/IEC 14496-10:2014,
2014. [Online]. Available: https://www.iso.org/standard/66069.html
[2] “Information technology âA˘Tˇ MPEG systems technologies âA˘Tˇ Part 7:
Common encryption in ISO base media file format files,” International
Organization for Standardization, Geneva, CH, Standard ISO/IEC
23001-7:2016, 2016. [Online]. Available:
https://www.iso.org/standard/68042.html
[3] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and
H. Esmaeilzadeh, “SnaPEA: Predictive Early Activation for Reducing
Computation in Deep Convolutional Neural Networks,” in Int’l Symp.
on Computer Architecture (ISCA), Jun 2018.
[4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
computing,” in 2016 ACM/IEEE 43rd Annual International
Symposium on Computer Architecture (ISCA), 2016, pp. 1–13.
[5] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free Deep Neural
Network Computing,” in ISCA, 2016.
[6] L. Batina, S. Bhasin, D. Jap, and S. Picek, “CSI NN: Reverse
engineering of neural network architectures through electromagnetic
side channel,” in 28th USENIX Security Symposium (USENIX Security
19). Santa Clara, CA: USENIX Association, Aug 2019, pp. 515–532.
[Online]. Available: https:
//www.usenix.org/conference/usenixsecurity19/presentation/batina
[7] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
energy-efficient reconfigurable accelerator for deep convolutional
neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1,
pp. 127–138, 2017.
[8] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield,
T. Massengill, M. Liu et al., “Serving DNNs in Real Time at
Datacenter Scale with Project Brainwave ,” IEEE Micro, vol. 38, no. 2,
pp. 8–20, 2018.
[9] N. Dowlin, R. Gilad-Bachrach, K. Laine, K. Lauter, M. Naehrig, and
J. Wernsing, “Cryptonets: Applying neural networks to encrypted data
with high throughput and accuracy,” Int’l Conf. on Machine Learning
(ICML), pp. 201–210, 2016. [Online]. Available:
http://dl.acm.org/citation.cfm?id=3045390.3045413
[10] M. J. Dworkin, “Sp 800-38c. recommendation for block cipher modes
of operation: The ccm mode for authentication and confidentiality,”
Gaithersburg, MD, USA, Tech. Rep., 2004.
[11] M. J. Dworkin, “Sp 800-38d. recommendation for block cipher modes
of operation: Galois/counter mode (gcm) and gmac,” Gaithersburg,
MD, USA, Tech. Rep., 2007.
[12] R. Elbaz, D. Champagne, R. B. Lee, L. Torres, G. Sassatelli, and
P. Guillemin, “Tec-tree: A low-cost, parallelizable tree for efficient
defense against memory replay attacks,” in Cryptographic Hardware
and Embedded Systems - CHES 2007, P. Paillier and I. Verbauwhede,
Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp.
289–302.
[13] C. W. Fletcher, L. Ren, A. Kwon, M. v. Dijk, E. Stefanov,
D. Serpanos, and S. Devadas, “A low-latency, low-area hardware
oblivious ram controller,” in 2015 IEEE 23rd Annual International
Symposium on Field-Programmable Custom Computing Machines,
May 2015, pp. 215–222.
[14] C. W. Fletcher, M. v. Dijk, and S. Devadas, “A secure processor
architecture for encrypted computation on untrusted programs,” in
Proceedings of the Seventh ACM Workshop on Scalable Trusted
Computing, ser. STC ’12. New York, NY, USA: ACM, 2012, pp. 3–8.
[Online]. Available: http://doi.acm.org/10.1145/2382536.2382540
[15] X. Gao, Y. Zhao, ÅA˛ukasz Dudziak, R. Mullins, and C. zhong Xu,
“Dynamic channel pruning: Feature boosting and suppression,” in
International Conference on Learning Representations, 2019. [Online].
Available: https://openreview.net/forum?id=BJxh2j0qYm
[16] B. Gassend, G. E. Suh, D. Clarke, M. van Dijk, and S. Devadas,
“Caches and hash trees for efficient memory integrity verification,” in
The Ninth International Symposium on High-Performance Computer
Architecture, 2003. HPCA-9 2003. Proceedings., Feb 2003, pp.
295–306.
[17] O. Goldreich and R. Ostrovsky, “Software protection and simulation
on oblivious rams,” J. ACM, vol. 43, no. 3, pp. 431–473, May 1996.
[Online]. Available: http://doi.acm.org/10.1145/233551.233553
[18] S. Gueron, “Memory encryption for general-purpose processors,”
IEEE Security Privacy, vol. 14, no. 6, pp. 54–62, Nov 2016.
[19] W. E. Hall and C. S. Jutla, “Parallelizable authentication trees,” in
Proceedings of the 12th International Conference on Selected Areas in
Cryptography, ser. SAC’05. Berlin, Heidelberg: Springer-Verlag,
2006, pp. 95–109. [Online]. Available:
http://dx.doi.org/10.1007/11693383_7
[20] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural
Network,” in Int’l Symp. on Computer Architecture (ISCA), Jun 2016.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
Image Recognition,” arXiv e-print, vol. arXiv:1512.0338, Dec 2015.
12
[22] Y. He, X. Zhang, and J. Sun, “Channel Pruning for Accelerating Very
Deep Neural Networks,” in Int’l Conf. on Computer Vision (ICCV),
Oct 2017.
[23] M. Henson and S. Taylor, “Memory encryption: A survey of existing
techniques,” ACM Comput. Surv., vol. 46, no. 4, pp. 53:1–53:26, Mar
2014. [Online]. Available: http://doi.acm.org/10.1145/2566673
[24] W. Hua, Z. Zhang, and G. E. Suh, “Reverse engineering convolutional
neural networks through side-channel information leaks,” in
Proceedings of the 55th Annual Design Automation Conference, ser.
DAC ’18. New York, NY, USA: ACM, 2018, pp. 4:1–4:6. [Online].
Available: http://doi.acm.org/10.1145/3195970.3196105
[25] W. Hua, Y. Zhou, C. De Sa, Z. Zhang, and G. E. Suh, “Boosting the
Performance of CNN Accelerators with Dynamic Fine-Grained
Channel Gating,” in Proceedings of the 52Nd Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO ’52.
New York, NY, USA: ACM, 2019, pp. 139–150. [Online]. Available:
http://doi.acm.org/10.1145/3352460.3358283
[26] W. Hua, Y. Zhou, C. M. De Sa, Z. Zhang, and G. E. Suh, “Channel
Gating Neural Networks,” in Advances in Neural Information
Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc.,
2019, pp. 1884–1894. [Online]. Available:
http://papers.nips.cc/paper/8464-channel-gating-neural-networks.pdf
[27] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely
connected convolutional networks,” Conf. on Computer Vision and
Pattern Recognition (CVPR), vol. 1, no. 2, p. 3, 2017.
[28] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22Nd ACM
International Conference on Multimedia, ser. MM ’14. New York,
NY, USA: ACM, 2014, pp. 675–678. [Online]. Available:
http://doi.acm.org/10.1145/2647868.2654889
[29] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-Datacenter
Performance Analysis of a Tensor Processing Unit,” Int’l Symp. on
Computer Architecture (ISCA), 2017.
[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
Classification with Deep Convolutional Neural Networks,” in
Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, ser. NIPS’12. USA:
Curran Associates Inc., 2012, pp. 1097–1105. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2999134.2999257
[31] J. Lee, T. Kim, and J. Huh, “Reducing the memory bandwidth
overheads of hardware security support for multi-core processors,”
IEEE Transactions on Computers, vol. 65, no. 11, pp. 3384–3397, Nov
2016.
[32] T. S. Lehman, A. D. Hilton, and B. C. Lee, “Poisonivy: Safe
speculation for secure memory,” in 2016 49th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), Oct 2016,
pp. 1–13.
[33] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
Filters for Efficient ConvNets,” in Int’l Conf. on Learning
Representations (ICLR), May 2017.
[34] H. Lipmaa, D. Wagner, and P. Rogaway, “Comments to nist
concerning aes modes of operation: Ctr-mode encryption,” 2000.
[35] J. Liu, M. Juuti, Y. Lu, and N. Asokan, “Oblivious Neural Network
Predictions via MiniONN Transformations,” ACM Conf. on Computer
and Communications Security (CCS), pp. 619–631, 2017. [Online].
Available: http://doi.acm.org/10.1145/3133956.3134056
[36] X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen,
“High level synthesis of complex applications: An h.264 video
decoder,” in Proceedings of the 2016 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, ser. FPGA âA˘Z´16.
New York, NY, USA: Association for Computing Machinery, 2016, p.
224âA˘S¸233. [Online]. Available:
https://doi.org/10.1145/2847263.2847274
[37] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
efficient convolutional networks through network slimming,” CoRR,
vol. abs/1708.06519, 2017. [Online]. Available:
http://arxiv.org/abs/1708.06519
[38] F. McKeen, I. Alexandrovich, I. Anati, D. Caspi, S. Johnson,
R. Leslie-Hurd, and C. Rozas, “Intel Software Guard Extensions (Intel
SGX) Support for Dynamic Memory Management Inside an Enclave,”
Hardware and Architectural Support for Security and Privacy (HASP),
2016.
[39] R. C. Merkle, “Protocols for public key cryptosystems,” in 1980 IEEE
Symposium on Security and Privacy, 1980, pp. 122–122.
[40] J. Nechvatal, E. Barker, L. Bassham, W. Burr, and M. Dworkin,
“Report on the development of the advanced encryption standard (aes),”
October 2000. [Online]. Available: http://www.nist.gov/aes
[41] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An
Accelerator for Compressed-sparse Convolutional Neural Networks,”
in ISCA, 2017.
[42] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M.
HernÃa˛ndez-Lobato, G. Y. Wei, and D. Brooks, “Minerva: Enabling
Low-Power, Highly-Accurate Deep Neural Network Accelerators,” in
ISCA, 2016.
[43] B. Rogers, S. Chhabra, M. Prvulovic, and Y. Solihin, “Using Address
Independent Seed Encryption and Bonsai Merkle Trees to Make
Secure Processors OS- and Performance-Friendly,” in Proceedings of
the 40th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO 40. Washington, DC, USA: IEEE
Computer Society, 2007, pp. 183–196. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2007.44
[44] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A Cycle
Accurate Memory System Simulator,” IEEE Computer Architecture
Letters, 2011.
[45] G. Saileshwar, P. Nair, P. Ramrakhyani, W. Elsasser, J. Joao, and
M. Qureshi, “Morphable counters: Enabling compact integrity trees
for low-overhead secure memories,” in 2018 51st Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), Oct 2018,
pp. 416–427.
[46] W. Shi and H.-H. S. Lee, “ase,” in Proceedings of the 39th Annual
IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO 39. Washington, DC, USA: IEEE Computer Society, 2006,
pp. 103–112. [Online]. Available:
https://doi.org/10.1109/MICRO.2006.11
[47] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” arXiv e-print, vol.
arXiv:1409.15568, Apr 2015.
[48] E. Stefanov, M. V. Dijk, E. Shi, T.-H. H. Chan, C. Fletcher, L. Ren,
X. Yu, and S. Devadas, “Path oram: An extremely simple oblivious
ram protocol,” J. ACM, vol. 65, no. 4, pp. 18:1–18:26, Apr 2018.
[Online]. Available: http://doi.acm.org/10.1145/3177872
[49] G. E. Suh, D. Clarke, B. Gassend, M. v. Dijk, and S. Devadas,
“Efficient memory integrity verification and encryption for secure
processors,” in Proceedings of the 36th Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO 36.
Washington, DC, USA: IEEE Computer Society, 2003, pp. 339–.
[Online]. Available: http://dl.acm.org/citation.cfm?id=956417.956575
[50] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas,
“AEGIS: Architecture for Tamper-evident and Tamper-resistant
Processing,” in Proceedings of the 17th Annual International
Conference on Supercomputing, ser. ICS ’03. New York, NY, USA:
ACM, 2003, pp. 160–171. [Online]. Available:
http://doi.acm.org/10.1145/782814.782838
[51] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep
neural networks: A tutorial and survey,” CoRR, vol. abs/1703.09039,
2017. [Online]. Available: http://arxiv.org/abs/1703.09039
[52] M. Taassori, A. Shafiee, and R. Balasubramonian, “Vault: Reducing
paging overheads in sgx with efficient integrity verification structures,”
in Proceedings of the Twenty-Third International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’18. New York, NY, USA: ACM, 2018, pp.
665–678. [Online]. Available:
http://doi.acm.org/10.1145/3173162.3177155
[53] D. L. C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and
M. Horowitz, “Architectural support for copy and tamper resistant
software,” in Proceedings of the Ninth International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS IX. New York, NY, USA: ACM, 2000, pp.
168–177. [Online]. Available:
http://doi.acm.org/10.1145/378993.379237
13
[54] Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A genomics
co-processor provides up to 15,000x acceleration on long read
assembly,” in Proceedings of the Twenty-Third International
Conference on Architectural Support for Programming Languages and
Operating Systems, ser. ASPLOS âA˘Z´18. New York, NY, USA:
Association for Computing Machinery, 2018, p. 199âA˘S¸213. [Online].
Available: https://doi.org/10.1145/3173162.3173193
[55] L. Wei, B. Luo, Y. Li, Y. Liu, and Q. Xu, “I know what you see: Power
side-channel attack on convolutional neural network accelerators,” in
Proceedings of the 34th Annual Computer Security Applications
Conference, ser. ACSAC ’18. New York, NY, USA: ACM, 2018, pp.
393–406. [Online]. Available:
http://doi.acm.org/10.1145/3274694.3274696
[56] Weidong Shi, H. S. Lee, M. Ghosh, Chenghuai Lu, and A. Boldyreva,
“High efficiency counter mode security architecture via prediction and
precomputation,” in 32nd International Symposium on Computer
Architecture (ISCA’05), June 2005, pp. 14–24.
[57] Xilinx, “Xilinx Vitis Security Library,”
https://xilinx.github.io/Vitis_Libraries/security/.
[58] Xilinx, “CHaiDNN-v2,” https://github.com/Xilinx/CHaiDNN, Jun
2018.
[59] C. Yan, D. Englender, M. Prvulovic, B. Rogers, and Y. Solihin,
“Improving cost, performance, and security of memory encryption and
authentication,” SIGARCH Comput. Archit. News, vol. 34, no. 2, pp.
179–190, May 2006. [Online]. Available:
http://doi.acm.org/10.1145/1150019.1136502
[60] M. Yan, C. W. Fletcher, and J. Torrellas, “Cache telepathy: Leveraging
shared resource attacks to learn DNN architectures,” CoRR, vol.
abs/1808.04761, 2018. [Online]. Available:
http://arxiv.org/abs/1808.04761
14
