Automated design of error-resilient and hardware-efficient deep neural
  networks by Schorn, Christoph et al.
Automated design of error-resilient and hardware-efficient
deep neural networks
Christoph Schorn1,2 · Thomas Elsken3,4 · Sebastian Vogel1,2 ·
Armin Runge1 · Andre Guntoro1 · Gerd Ascheid2
Abstract Applying deep neural networks (DNNs) in
mobile and safety-critical systems, such as autonomous
vehicles, demands a reliable and efficient execution on
hardware. Optimized dedicated hardware accelerators
are being developed to achieve this. However, the de-
sign of efficient and reliable hardware has become in-
creasingly difficult, due to the increased complexity of
modern integrated circuit technology and its sensiti-
vity against hardware faults, such as random bit-flips.
It is thus desirable to exploit optimization potential for
error resilience and efficiency also at the algorithmic
side, e.g. by optimizing the architecture of the DNN.
Since there are numerous design choices for the ar-
chitecture of DNNs, with partially opposing effects on
the preferred characteristics (such as small error rates
at low latency), multi-objective optimization strategies
are necessary. In this paper, we develop an evolution-
ary optimization technique for the automated design of
hardware-optimized DNN architectures. For this pur-
pose, we derive a set of easily computable objective
functions, which enable the fast evaluation of DNN ar-
chitectures with respect to their hardware efficiency and
error resilience solely based on the network topology.
We observe a strong correlation between predicted error
resilience and actual measurements obtained from fault
injection simulations. Furthermore, we analyze two dif-
  Christoph Schorn
Christoph.Schorn@de.bosch.com
1 Department of Dependable Connected Systems, Bosch
Corporate Research, Renningen, Germany
2 Institute for Communication Technologies and
Embedded Systems, RWTH Aachen University,
Aachen, Germany
3 Bosch Center for Artificial Intelligence, Renningen,
Germany
4 Department of Computer Science, University of
Freiburg, Freiburg im Breisgau, Germany
ferent quantization schemes for efficient DNN compu-
tation and find significant differences regarding their
effect on error resilience.
Keywords Neural Network Hardware · Error Re-
silience · Hardware Faults · Neural Architecture
Search · Multi-Objective Optimization · AutoML
1 Introduction
The application of deep neural networks (DNNs)
in safety-critical perception systems, for example
autonomous vehicles (AVs), poses some challenges on
the design of the underlying hardware platforms. On
the one hand, efficient and fast accelerators are needed,
since DNNs for computer vision exhibit massive
computational requirements [55]. On the other hand,
resilience against random hardware faults has to be
ensured. In many driving scenarios, entering a fail-safe
state is not sufficient, but fail-operational behavior
and fault tolerance are required [48]. However, fault
tolerance techniques at the hardware level often entail
large redundancy overheads in silicon area, latency,
and power consumption. These overheads stand in
contrast to the low-power and low-latency requirements
of embedded real-time DNN accelerators. Reliability
concerns in nanoscale integrated circuits, for instance
soft errors in memory and logic, represent an addi-
tional challenge for the realization of fault tolerance
mechanisms at the hardware level [2, 33, 36, 68, 83].
Moreover, techniques such as near-threshold computing
[26] and approximate computing [65] are desirable to
meet power constraints, but can further increase error
rates.
To overcome these challenges, one option is to ex-
ploit error resilience at the algorithm level and allow for
a certain degree of inaccuracy at the hardware level.
ar
X
iv
:1
90
9.
13
84
4v
1 
 [c
s.L
G]
  3
0 S
ep
 20
19
2 Schorn et al.
This is referred to as cross-layer resilience [13]. Due
to the implicit information redundancy of neural net-
works, they offer some robustness against random in-
ternal perturbations, which can be exploited in a cross-
layer resilience approach. Nevertheless, error resilience
is strongly influenced by the architectural design of the
DNN [82] as well as its internal data representations
[53]. These design choices, in turn, also influence hard-
ware efficiency and classification performance of the
network. Taking these multiple, partially opposing ob-
jectives into account in a manual DNN design procedure
is non-trivial and cumbersome.
As a solution, we develop and evaluate an efficient,
automated, multi-objective neural architecture search
(NAS) technique in this paper, which holistically takes
classification performance as well as hardware-specific
objective functions into account. In detail, our contri-
butions are the following:
1. We derive a set of objective functions for the pre-
diction of error resilience, energy consumption, la-
tency and required bandwidth of DNNs on hard-
ware, solely based on the topology of their neural
architecture, allowing a fast evaluation of these ob-
jectives by avoiding the need for expensive simula-
tions or training of the neural network.
2. We integrate these functions in an efficient, evolu-
tionary, multi-objective NAS algorithm, that uses
(approximate) network morphisms for a fast Pareto
optimization of DNNs.
3. We evaluate our methods and obtained Pareto
trade-offs on two popular image classification
benchmarks, namely CIFAR-10 and German Traffic
Sign Recognition Benchmark (GTSRB). In partic-
ular, we test the predictive performance of our fast
error resilience prediction metric by taking silent
data corruption (SDC) measurements, employing a
memory bit-flip fault injection framework.
4. We compare two recently introduced quantization
techniques for hardware-efficient DNN inference
with respect to resulting classification performance
and error resilience characteristics of the neural
networks.
To the best of our knowledge, this is the first paper
combining error resilience and hardware efficiency op-
timization in the context of neural architecture search.
The remainder of this paper is structured as follows.
In Section 2, we give an overview of related work. In
Section 3, we introduce our methodology. This includes
the derivation of hardware-specific objective functions,
neural network quantization techniques and the multi-
objective optimization algorithm used in this paper. In
Section 4, we evaluate the outcome of our methods on
two image classification benchmarks. We analyze the
trade-offs between Pareto-optimal solutions, perform
fault injections to compare predicted and measured re-
silience, and evaluate the characteristics of two different
DNN quantization methods. We close our paper with a
summary and conclusions in Section 5.
2 Background and related work
We now give an overview of related error resilience anal-
ysis (Section 2.1), resilience optimization techniques for
neural networks (Section 2.2) as well as preliminaries on
multi-objective optimization (Section 2.3) and neural
architecture search (NAS) (Section 2.4).
2.1 Neural network resilience analysis
Understanding a neural network’s resilience against er-
roneous perturbations in its internal computations has
been a topic of interest for decades already. Here, we
give an overview of the most recent studies that target
error resilience analysis of modern DNNs. An in-depth
review of previous literature has been recently given by
Torres-Huitzil and Girau [91].
2.1.1 Experimental analysis
The majority of studies on error resilience in neural net-
works has been experimental. They range from physi-
cal fault induction experiments in real hardware de-
vices [78, 96], over fault injections in (virtual) hardware
models [3, 53, 76, 78], to error simulations at the algo-
rithmic behavior level [62, 72, 80]. Behavioral analysis
can be connected to realistic hardware faults in a sec-
ond step, by mapping the effect of these faults to error
models in the algorithm domain [70]. For the model-
based analysis, stuck-at-zero, stuck-at-one and random
bit-flips of memory cells are commonly used. Stuck-at
types are used to model permanent faults (e.g. resulting
from manufacturing defects) and bit-flips are typically
used to model radiation-induced transient faults that
lead to soft errors [91].
In summary, experimental studies found different
determinants of neural network resilience, the most im-
portant being the number and type of errors, the data
representation of the neural network, the DNN type and
the location where the error occurs. However, while ex-
perimental evaluation is useful for an accurate a poste-
riori resilience determination of a given DNN on hard-
ware, it is cumbersome and provides only limited in-
sight into a priori design choices for DNN developers to
improve resilience at the algorithm level.
Automated design of error-resilient and hardware-efficient deep neural networks 3
2.1.2 Theoretical analysis
A theory-guided resilience analysis offers the advantage
of being more directly interpretable and avoids lengthy
fault injection experiments. El Mhamdi and Guerraoui
[28] analytically derived easily computable bounds for
the forward error propagation of neurons that are stuck-
at-zero (crashed neurons) and for neurons that transmit
arbitrary values (Byzantine neurons). They found that
the choice of activation function and number of neurons
per layer are design choices that affect the forward er-
ror propagation. More precisely, an activation function
with a low Lipschitz constant as well as a high number
of neurons per layer can reduce forward error propaga-
tion.
A different analytical technique to derive neuron re-
silience prediction has been used in the context of ap-
proximate neural network computing. Backpropagation
of error gradients, comparable to the technique used to
determine weight updates during neural network train-
ing, has been used to estimate the average output sen-
sitivity to perturbations in individual neurons [93, 106].
Recently, Schorn et al. [80] showed that a technique
based on layer-wise relevance propagation (LRP) [4]
outperforms gradient-based resilience prediction. Con-
trarily to gradient methods, which determine the sensi-
tivity to small perturbations in neurons, LRP attributes
to each neuron its absolute contribution to the DNN
output [67], which can be interpreted as layerwise Tay-
lor decomposition [66]. A high neuron relevance, aver-
aged over a training set of input samples, corresponds
to a high sensitivity against errors [80].
2.2 Neural network resilience optimization
The optimization of neural network error resilience at
the algorithm level is an active field of research. A
number of publications simulate the effects of hardware
faults during neural network training to improve re-
silience [22, 102, 45, 100, 56]. Reference [22] considers
timing variations, [102, 45] static random-access mem-
ory (RAM) supply voltage scaling and [100, 56] hard
defects in memristors and resistive RAM respectively.
The drawback of these approaches is that they compli-
cate the training process, since fault injections have to
be performed by placing hardware in the training loop
or through realistic fault simulations. Common regu-
larizing techniques, such as dropout [44, 85] and weight
decay [50], also improve the general error resilience of
neurons [28].
A second approach is to adjust the mapping of the
algorithm to hardware for an optimized resilience. A
significance-driven mapping of network weight bits to
memory cells with different resilience has been sug-
gested in [84]. However, the authors did not follow an
analytical approach to determine weight resiliencies,
but relied on their experience. In contrast, the LRP-
based method in [80] gives a theoretically founded re-
silience mapping of neurons.
A third approach is to use modifications in hardware
that are tailored to exploit the algorithmic resilience
properties of neural networks. This can be zero-biased
[3] or selectively hardened [53] memory cells, optimized
data representations [96], masking techniques [71, 76],
anomaly detectors [53, 81] and relaxed versions of clas-
sical fault tolerance mechanisms, such as triple modular
redundancy (TMR) [61] and algorithm-based fault tol-
erance (ABFT) checksums [78].
Modifications of the neural architecture to increase
resilience have been proposed as well. Dias et al. [24]
suggest a resilience optimization procedure by replica-
tion of critical neurons and weights. However, they use
exhaustive simulation to determine criticality values,
which is infeasible for large-scale DNNs. Schorn et al.
[82] showed that critical layers can be identified using
LRP. Nevertheless, no automated neural architecture
design technique that jointly optimizes error resilience
as well as other desirable performance and efficiency
objectives of DNNs has been introduced so far.
2.3 Multi-objective optimization
In multi-objective optimization (see, e.g. [63]), one tries
to optimize multiple, complementary objective func-
tions f1, . . . , fk over a space N of feasible solutions (in
our case: a space of neural network architectures). Usu-
ally, there will be no N∗ ∈ N that minimizes all ob-
jectives f1, . . . , fk at the same time (as the objectives
are complementary). Instead, there are multiple Pareto-
optimal solutions meaning that one cannot reduce any
fi without increasing at least one fj . Formally, we say
that N1 dominates N2 iff fi(N1) ≤ fi(N2) for every
i ∈ {1, . . . , n} and fj(N1) < fj(N2) for at least one j.
N∗ is called Pareto-optimal iff N∗ is not dominated by
any other N ∈ N . The set of Pareto-optimal solutions
is the so-called Pareto front. Typically, multi-objective
optimization can only determine a subset P that ap-
proximates this Pareto front.
In order to rate the overall performance of a given
neural network N ∈ P across all objectives, the dis-
tance to the ideal point can be used as a metric [8]. The
(approximate) coordinates yi of the ideal point in each
objective dimension i ∈ {1, . . . , k} can be determined
by taking the componentwise minima of the objective
functions fi(N) over the (approximated) Pareto front
4 Schorn et al.
P [27]:
yi = min
N∈P
fi(N), i ∈ {1, . . . , k}. (1)
To enhance comparability, a normalized version of
the distance to the ideal point can be computed [8].
Therefore, the individual objective functions are first
normalized
f¯i(N) =
fi(N)−minN∈P fi(N)
maxN∈P fi(N)−minN∈P fi(N) (2)
so that 0 ≤ f¯i(N) ≤ 1. Then, a norm on the vector
f¯(N) = (f¯1(N), . . . , f¯k(N))
> ∈ Rk is computed to mea-
sure the distance of N from the ideal point.
Blasco et al. [8] suggest to take the infinity norm for
the purpose of trade-off analysis:∥∥¯f(N)∥∥∞ = max{f¯i(N)} , i ∈ {1, . . . , k}. (3)
That way, a score between 0 and 1 is obtained, which
supplies information about the worst objective value.
For example, a value of 1 means that N has the worst
observed performance in at least one of the objectives.
We refer to
∥∥¯f(N)∥∥∞ as normalized worst objective
value in the remainder of this paper.
2.4 Neural Architecture Search
One crucial aspect for the success of deep learning in
recent years was the design of novel neural network ar-
chitectures [35, 40, 77, 89]. However, manually design-
ing such architectures is a cumbersome trial-and-error
process. To overcome the need for architectural engi-
neering, neural architecture search (NAS) - the process
of automatically designing neural network architectures
- has arisen as a subfield of automated machine learn-
ing [41]. By now, architectures found by NAS have out-
performed human-designed architectures on a variety
of tasks such as image recognition [74], object detec-
tion [109] or dense prediction tasks [17, 75].
We briefly summarize related work here and refer to
the survey by Elsken et al. [31] for a more thorough lit-
erature overview. Reinforcement learning techniques [5,
108, 107, 109] or evolutionary methods [87, 64, 73, 74]
were employed to search for well performing architec-
tures. As early work required vast amount of computa-
tional resources, often in the range of hundreds or even
thousands of GPU days [108, 109, 74], making NAS
more efficient was the focus of many researchers, e.g.
by employing network morphisms [9, 10, 29], by sharing
weights [79, 7, 69] or by performance prediction [6, 47].
A recent series of work [58, 101, 11, 103] employed a
real-valued relaxation of the discrete architecture search
space, enabling gradient-based optimization.
While the previously discussed approaches solely
optimize for a single objective, namely minimizing some
error rate, there has also been some work on multi-
objective neural architecture search [46, 25, 90, 60, 99,
16, 39, 30], optimizing other objectives such as network
size, latency or energy consumption concurrently. [25]
extend [57] by considering multiple objectives during
the model selection step. [60] employ NSGA-II [21],
a well known multi-objective optimization algorithm,
in the context of NAS. Instead of actually solving the
multi-objective problem, many researchers use scalar-
ization methods, such as the weighted product or sum
method [20], to obtain a single objective. This is then
optimized via, e.g. reinforcement learning [90, 39] or dif-
ferentiable NAS [99]. [12] use multi-objective Bayesian
optimization to search for convolutional cells [109]. In
this work, we will build up on the multi-objective evolu-
tionary method LEMONADE [30] that exploits cheap-to-
evaluate objectives to make the search more efficient.
This perfectly fits our application as we will see later
as our objectives are solely based on the neural net-
work architecture (and not, e.g., on expensive simula-
tions or trained neural network weights) and thus cheap
to compute. We discuss LEMONADE more detailed in Sec-
tion 3.2.1.
3 Hardware-focused neural architecture design
In this section, we introduce our framework for the
automated design of error-resilient and hardware-
efficient DNN architectures. In a first step, we identify
optimization goals that typically appear in embedded
DNN hardware applications and derive corresponding
objective functions (Section 3.1). In the further course
of this paper, these functions serve as input to a
multi-objective neural architecture search algorithm
(Section 3.2). Fixed-point quantization is applied as
post-processing step after NAS to enable efficient
DNN execution on dedicated hardware accelerators
(Section 3.3).
3.1 Hardware-specific objectives
We consider four different objectives that are commonly
desirable in embedded DNN hardware applications,
namely high error resilience, low latency, high energy
efficiency and a low bandwidth requirement.
3.1.1 Error resilience
In the context of this paper, error resilience is regarded
as robustness of the neural network classifier against
Automated design of error-resilient and hardware-efficient deep neural networks 5
perturbations in its neuron activation values. Such per-
turbations can be the result of random hardware faults,
such as radiation-induced bit-flips. We measure the de-
gree of perturbation using bit error rate (BER), i.e.
the fraction of flipped bits across all activations of the
DNN. We define architecture sensitivity at a given BER
as probability for the predicted class output to differ,
with and without bit errors. In order to maximize error
resilience, we want to minimize architecture sensitivity.
Following the approach in [80] and [82], we derive
an architecture-dependent error sensitivity metric us-
ing LRP. A key prerequisite in the mathematical frame-
work of LRP is the relevance conservation principle [67].
It ensures that the total amount of neuron relevance,
which is propagated backwards through the network af-
ter the forward pass of inference on an input sample,
is conserved in each layer. Consequently, for a group of
neurons k and its inputs j,∑
j
rj =
∑
j
∑
k
rj←k =
∑
k
∑
j
rj←k =
∑
k
rk, (4)
where rj and rk are the relevance values attributed to
neurons j and k, respectively, and rj←k is the amount
of relevance propagated backwards from neuron k to
neuron j. The conservation principle is motivated by
the fact that an output activation of neuron k can be
completely decomposed into contributions of its input
neurons j.
The relevance distribution among the neurons in
each layer depends on their activations and the synap-
tic weights [67]. For the initial backpropagation step,
the final output neuron relevance of the DNN is prede-
termined by the one-hot encoded target vector belong-
ing to each input sample. This ensures that
∑
k rk = 1
in each layer. Consequently, for a uniformly randomly
drawn neuron in a layer l, the expected relevance is
E
[
r
(l)
k
]
=
1
n
(l)
outputs
, k ∼ unif{0, n(l)outputs − 1}, (5)
where n
(l)
outputs is the total number of neurons in that
layer. The observation that a higher average relevance
corresponds to a more likely change of the DNN clas-
sification output suggests that layers with few neurons
are more sensitive to errors [80, 82].
Effect of max-pooling. Max-pooling is commonly used
in some layers of a DNN, in order to reduce the output
dimensions of that layer [51]. A max-pooling stage di-
vides the outputs of a layer into subsets and selects the
maximum output value out of each subset. We do not
regard max-pooling as a separate layer, but consider it
as attachment to a layer. If a layer l has max-pooling,
the reduced number of output values after the pooling
stage is taken to calculate n
(l)
outputs.
Additionally, we observed an increased error sensi-
tivity of neurons in layer l if max-pooling is present in
the subsequent layer l + 1. We suppose that this is be-
cause information about the input sample is reduced by
the pooling stage, but critical errors, which are mostly
changes from a low to a high activation value [53], are
likely to propagate through. Thus, we obtain an effec-
tive error sensitivity of neurons in layer l by multiplica-
tion with the pooling factor of layer l+ 1. The pooling
factor is the fraction of input to output dimension of the
pooling stage and equals 4 for the max-pooling layers
that we use throughout our experiments.
Effect of merge layers. Skip connections, i.e. concurrent
paths through the network, can improve the training of
deep architectures and thus have become popular in
state-of-the-art DNNs [34]. At some point in the net-
work, the parallel paths have to be merged again, which
can be done by componentwise addition [35] or by fea-
ture concatenation [89]. While a concatenation does not
affect error propagation, an add layer increases error
sensitivity of the DNN. There are two reasons for this.
Firstly, an add layer involves additional (error prone)
load, accumulate and store operations, while concate-
nation only involves the change of the address range
from which data is loaded in the subsequent layers.
Secondly, the fraction of neurons affected by errors
is likely to increase through the add operation. If two
inputs with an equal and small fraction of erroneous
neurons are added, the resulting fraction of erroneous
neurons doubles as long as the error locations in the in-
puts do not coincide. This can be regarded as doubling
the effective error sensitivity of the neurons preceding
the add operation.
Architecture sensitivity index. The aforementioned in-
sights are now used to define a metric that estimates
the error sensitivity of a neural network N solely based
on the topology of its architecture. We call this metric
architecture sensitivity index (ASI). It is defined as sum
of the expected error sensitivities over all layers LN of
N ,
fASI(N) =
∑
l∈LN
λ(l)ζ(l)
n
(l)
outputs
, (6)
where λ(l) is the max-pooling factor of the succeeding
layer l + 1 (i.e. 1 for no pooling) and ζ(l) is 2, if l is
connected to an add layer, else 1.
Concatenations are not counted as extra layers in
this sum, while add layers are. Furthermore, supportive
6 Schorn et al.
functionalities, such as activation function, pooling and
batch normalization [42] are not considered as separate
layers, but included in the neuron layers.
We want to emphasize that fASI can be computed
very easily, since it only depends on the network topol-
ogy and does not require any training or other expensive
computations.
3.1.2 Latency
Aside from error resilience, real-time inference with low
latency is an additional necessity in many applications.
AVs, for instance, should be able to derive driving ac-
tions from sensory input in less than 100 ms, in order to
surpass human-level perception performance and pro-
vide a sufficient level of safety [55]. While low latency
can be achieved by employing a parallelized hardware
architecture and a high operating frequency, the perfor-
mance of a DNN accelerator is constrained by manu-
facturing, power consumption, reliability, and flexibility
requirements. Thus, a reduction of computational com-
plexity at the algorithm level is desirable.
The roofline model [98] is commonly used to
describe the attainable computational performance
of a DNN accelerator [104]. It defines two opera-
tional domains, which are entered depending on the
computational workload of the accelerator. In the
memory-bound domain, latency is determined by the
amount of data transfer to memory and the available
memory bandwidth. In the compute-bound domain,
latency can be regarded as being proportional to the
number of operations required by the algorithm.
Being compute-bound is preferable over memory-
bound operation, since it allows maximum utilization
of the available computational resources and highest
throughput. Thus, we assume an accelerator, whose
memory bandwidth is sufficiently large so that it will
predominantly operate in the compute-bound domain
for the workloads considered in this paper. We can
therefore take the number of operations of the DNN
as approximate determinant of latency. Furthermore,
we regard the number of operations as being solely de-
pendent on the neural network architecture, i.e. we do
not consider any data-dependent operation reductions.
Our objective function for latency reduction is given
by
flatency(N) =
∑
l∈LN
n(l)op , (7)
where n
(l)
op counts the number of operations of layer l.
3.1.3 Energy efficiency
A further frequent demand on embedded DNN accel-
erators is a low energy consumption per classification
inference. This can have mainly two reasons. Firstly,
mobile devices have a limited amount of energy stor-
age capacity and thus energy-efficient DNN accelerators
are required, for example to extend the battery life and
range of AVs. Secondly, embedded devices often have a
strict size limitation, which makes it difficult to realize
the necessary heat dissipation. As the thermal leakage
power of an accelerator directly depends on the number
of classifications per second and the energy per classi-
fication, energy efficiency is desirable to enable high
classification throughput.
Energy consumption of DNN accelerators is domi-
nated by data transfers to and from memory [88]. This
is due to the large amount of parameters and interme-
diate data outputs of typical large-scale DNNs.
According to Horowitz [38], energy consumption for
off-chip dynamic RAM access is about two orders of
magnitude higher than for internal cache accesses and
arithmetic operations. While some hardware designers
increase energy efficiency by integrating huge on-chip
static RAMs in their DNN accelerator (e.g. [97]), this
approach is not feasible in every case. In this paper, we
assume an accelerator with small on-chip buffer (such
as [15]), so that a layerwise data transfer to and from
off-chip memory is necessary, which dominates energy
consumption.
Consequently, to maximize energy efficiency, our ob-
jective is to minimize data transfer to and from memory
per inference. We neglect the number of operations in
this calculation because of its limited influence on en-
ergy consumption and since it is already part of the
latency minimization objective function. To determine
the data transfer of a layer, we assume that each in-
put and weight parameter of the layer is loaded once
from external memory and each output is written back
once. Furthermore, we assume that the same bit-width
is used to represent all activations and parameters of
the network.
Our objective function for minimizing energy con-
sumption is thus given by the sum of layerwise input,
output, and parameter data word transfers over the
whole network,
fenergy(N) =
∑
l∈LN
(n
(l)
inputs + n
(l)
outputs + n
(l)
params), (8)
where n
(l)
inputs and n
(l)
outputs count the number of input
neurons and output neurons, respectively, and n
(l)
params
counts the number of parameters of layer l.
Automated design of error-resilient and hardware-efficient deep neural networks 7
3.1.4 Bandwidth requirement
As described in Section 3.1.2, we assume the accelera-
tor for which we optimize DNN architectures to oper-
ate predominantly in the compute-bound domain of the
roofline model. In order to guarantee compute-bound
operation, the accelerator has to provide a certain max-
imum bandwidth to memory. It is desirable to keep this
bandwidth requirement within bounds to simplify the
accelerator architecture.
The required memory bandwidth can vary for the
different layers of a DNN. We employ the ratio between
data transfers and operations of a layer as estimator for
its bandwidth requirement. The intuition behind this is
that a low number of operations is related to a short
processing time of the layer and consequently a high
bandwidth is required to be able to perform the neces-
sary data movements in that given time.
We define an overall objective function to optimize
neural architectures for a low bandwidth requirement
by adding up the data-computation ratios of all layers.
Thus our objective function for minimizing the band-
width requirement is given by the accumulated data-
computation ratio (ADCR):
fADCR(N) =
∑
l∈LN
n
(l)
inputs + n
(l)
outputs + n
(l)
params
n
(l)
op
. (9)
3.2 Multi-objective NAS
In the following, we introduce LEMONADE, a Lamarckian
Evolutionary algorithm for Multi-Objective Neural
Architecture DEsign [30], that we will use in our later
experiments to automatically design well-performing,
error-resilient, and hardware-efficient architectures.
3.2.1 LEMONADE
LEMONADE maintains a population P of neural networks
N . This population is improved over the course of
the algorithm with respect to multi-objective opti-
mization problem minN∈N f(N), where N denotes a
suitable space of neural network architectures (see
Section 3.2.2) and the objective function
f(N) = (fexp(N), fcheap(N))
> ∈ Rm ×Rn (10)
is split into expensive-to-evaluate objectives fexp(N) ∈
Rm (in our case: the validation error, only obtainable
by expensive training) and cheap-to-evaluate objectives
fcheap(N) ∈ Rn (in our case: the objectives defined in
Section 3.1). The population P is chosen to comprise
all non-dominated networks with respect to f, i.e. the
population approximates the Pareto front. LEMONADE
exploits that fcheap is cheap to evaluate in order to bias
the sampling of children towards areas of the Pareto
front of fcheap that are sparsely populated. While fcheap
is evaluated many times in LEMONADE, fexp is evaluated
only a few times for promising networks that are likely
to improve the approximation of the Pareto front.
In every iteration of LEMONADE, firstly parent net-
works are sampled with respect to some probability
distribution (discussed later) that is solely based on the
cheap objectives. By applying mutations to the parents
(such as adding or removing a layer, see Section 3.2.2
for a detailed description), children are generated. In a
second sampling stage, a subset of all generated children
is selected, again solely based on cheap objectives, and
solely this subset is evaluated on the expensive objec-
tives fexp. Lastly, LEMONADE computes the Pareto front
from the current generation and the subset of generated
children, yielding the next generation. The described
procedure is repeated for a pre-specified number of it-
erations.
The sampling distribution. The sampling distribution
is designed to only depend on the cheap objectives and
to guide the search towards sparsely crowded regions
in the current Pareto front. In order to achieve this,
LEMONADE computes a kernel density estimator pKDE
on the cheap objective values {fcheap(N)|N ∈ P} of
the current population. Then, for both sampling stages
(i.e. (i) the probability for choosing a network N as a
parent as well as (ii) the probability of a generated child
N being part of the subset), LEMONADE uses a sampling
distribution anti-proportional to pKDE:
p(N) =
c
pKDE(fcheap(N))
, (11)
with a proper normalizing constant c. Therefore,
networks in sparsely populated regions of the Pareto
front are more likely to be chosen as parents and
generated children lying in sparsely populated regions
of the Pareto front are more likely to be evaluated
on f. The motivation behind also choosing parents in
less crowded regions is that mutations do not change
the network drastically, hence children are expected to
have similar objective values as their parents. By this
sampling distribution and the two-staged sampling
strategy, LEMONADE generates and evaluates more
children that are more likely to improve the current
approximation of the Pareto front rather then just
evaluating the cheap objective fexp(N) for all children,
making it more efficient than off-the-shelf multi-
objective optimization algorithms. We highlight that
all objectives from Section 3.1 are cheap-to-evaluate as
8 Schorn et al.
they all solely depend on the neural network architec-
ture and not, e.g. on the weights of the network only
obtainable by expensive training. Hence, LEMONADE is
a perfect fit for our purpose. For more details, we refer
the reader to the original work [30].
3.2.2 Search space and mutations within LEMONADE
In this work, we focus on NAS for image classification
tasks. Convolutional neural networks (CNNs) are the
predominantly used type of DNN in this domain [51].
However, in the recent years, the number of variations
and design choices for CNN architectures has signifi-
cantly grown (see e.g. [34] for an overview). We limit
the search space of LEMONADE to a number of prede-
fined building blocks, hyperparameters and allowed mu-
tations for two reasons. Firstly, support for a limited set
of building blocks requires less flexibility of the under-
lying hardware. This enables the use of more efficient
dedicated DNN accelerators instead of general purpose
hardware. Secondly, the space of feasible architectures
N rapidly grows with each additional variation that is
allowed. This combinatorial explosion slows down the
convergence of NAS, which is why a reasonable limita-
tion of the search space has to be chosen.
We now describe the set of mutations that are used
by LEMONADE in our experiments to generate child net-
works.
1. Insert a convolutional layer with batch normaliza-
tion [42] and rectified linear unit (ReLU) activation
[32]. The layer is inserted at a random position and
its number of filters is chosen to match the number
of filters of the preceding layer. The kernel height h
and width w of the convolutional filter are randomly
sampled: (h,w) ∈ {(3, 3), (5, 5), (7, 7), (9, 9)}.
2. Increase the number of filters of a randomly chosen
convolution by a randomly chosen factor ∈ {2, 4}. A
maximum of 1100 filters is allowed.
3. Add a skip connection. We allow skip connection
either by concatenation [89] or by addition [35].
4. Remove a randomly chosen layer or a skip connec-
tion.
5. Prune a randomly chosen convolutional layer (i.e.
remove 1/2 or 1/4 of its filters). A minimum of 15
filters is allowed.
6. Replace a randomly chosen convolution by a depth-
wise separable convolution [18].
Note that by random we always mean uniformly at ran-
dom. We highlight that the first three operations in gen-
eral increase objectives such as network’s size or energy
consumption, but likely also decrease objectives such
as the error, while the last three operations in general
decrease the firstly mentioned objectives, but increase
the lastly objectives. Consequently, these mutations are
suitable for multiple, opposing objectives.
To further speed up NAS, the authors of LEMONADE
propose to apply these mutations as network morphisms
[14, 95]. Network morphisms are function-preserving
operators on neural networks, i.e. a network morphism
maps a neural network Nw with weights w to another
neural network N˜ w˜ with weights w˜ so that for every
input x to the network Nw(x) = N˜ w˜(x). Effectively
this means that, when utilizing network morphisms as
mutations to generate children, children do not need to
be trained from scratch but rather just fine-tuned as
children by design have the same error as their parent.
This can be interpreted as Lamarckian inheritance in
the context of evolutionary algorithms, where Lamar-
ckism refers to a mechanism which allows passing skills
acquired during an individual’s lifetime (e.g. by means
of learning), on to children by means of inheritance. The
equality Nw(x) = N˜ w˜(x) can be achieved by properly
choosing w˜. For example, if one wants to insert a linear
layer at an arbitrary position in a network, equality can
be achieved by simply initializing the linear layer as an
identity mapping. Mutations 1–3 from above can all be
formulated as a network morphism (see [30] for details).
Mutations 4–6, on the other hand, cannot be framed as
network morphisms, as they all generally decrease the
network’s capacity and equality cannot be guaranteed.
Instead, Elsken et al. [30] propose approximate network
morphisms to find proper initialization for these cases.
Approximate network morphisms essentially copy the
weights of layers not affected by structural changes and
train affected layers via knowledge distillation [37].
3.3 Fixed-point quantization
Neural network training algorithms usually rely on
data representations and computations with high
numerical precision, for example a 32-bit floating-point
format, typically used in graphics processing units
(GPUs). However, after training, a reduced-precision
number format can be used for inference on a dedicated
DNN accelerator to reduce energy consumption and
bandwidth [54]. In this context, an 8-bit fixed-point
format is a common choice in embedded and mobile
devices [43]. Hence, to deploy a DNN on an embedded
device after training on a GPU, weights, biases and
activations need to be transformed from a floating-
point to a fixed-point number format. This procedure
is denoted by network quantization. We apply network
quantization as post-processing step after neural
architecture search with LEMONADE.
Automated design of error-resilient and hardware-efficient deep neural networks 9
To quantize a real value χ to a signed fixed-point
value χq using B bit, we determine
χq = clip
(
round
( χ
∆
)
,−2B−1, 2B−1
)
∆, (12)
where ∆ denotes the step size, i. e. the smallest dis-
tance between two quantization sampling points of χ.
In other words, ∆ corresponds to the value of the least
significant bit (LSB).
In [92], a simple method to find a suitable step size
for a given data distribution in DNNs with sigmoid ac-
tivations is introduced. It determines the step size ∆
based on the maximum range of a distribution accord-
ing to
∆ =
max (|χ|)
2N−1 − 1 . (13)
In the following, we refer to this quantization method as
MaxRange. However, modern DNNs commonly use un-
bounded activation functions, such as ReLU, and thus
may entail data distributions with far outliers. Since the
quantization range is adapted to the maximum value,
the step size ∆ is maximal and consequently leads to
a coarse sampling of smaller values. Moreover, as data
distributions in DNNs typically follow a Gaussian dis-
tribution, (13) leads to a coarse sampling of a large
number of values.
A quantization method which specifically targets
this problem has been introduced in [94]. Here, param-
eters and activations are quantized by minimizing the
effect of the quantization error δ = χ − χq in the net-
work. In a neural network, the output value y of a neu-
ron with a rectifying unit Φ(·), bias b, weights w and
input values x is determined by
y = Φ
(
b+
∑
wx
)
. (14)
For the purpose of measuring the influence of the quan-
tization error of inputs (δx), weights (δw) and biases
(δb), we define y˜ as the resulting neuron output when
quantities of (14) are quantized. More precisely, y˜w is
defined as the neuron output determined with quan-
tized weights wq where activations and biases remain
in a 32 bit floating-point number format. y˜x and y˜b are
defined accordingly. The step sizes ∆(l) are then indi-
vidually determined for each layer by
∆(l)w = arg min
∆
(l)
w
∣∣∣y(l) − y˜(l)w ∣∣∣2 ,
∆(l)x = arg min
∆
(l)
x
∣∣∣y(l) − y˜(l)x ∣∣∣2 and
∆
(l)
b = arg min
∆
(l)
b
∣∣∣y(l) − y˜(l)b ∣∣∣2 .
(15)
We additionally constrain the step sizes to a power-of-
two value, i. e. ∆ ∈ {2z | z ∈ Z}, to enable a direct fixed-
point operation in a hardware accelerator. In the rest
of the paper, this quantization method is referred to as
minimal propagated quantization error (MinPQE).
4 Experiments
4.1 Experimental setup
To evaluate our methods, we use two common image
classification benchmarks. Firstly, CIFAR-10 [49] is
used, which consists of 32 × 32 pixel RGB images
divided into ten distinct classes. The samples are
divided into 50 000 training and 10 000 test samples.
Out of the training set, 5000 samples are used for
validation during neural architecture search.
Secondly, GTSRB [86] is used, which contains RGB
images of 43 different types of traffic signs. The images
of this benchmark are scaled to a resolution of 48 ×
48 pixels before they are fed into the classifier. The
dataset has 39 210 training samples, out of which 4010
are separated for classification validation. An additional
set of 12 630 images is used for measuring final test error
rates.
Unless otherwise noted, we use the same hyperpa-
rameter setup for both benchmarks. We run LEMONADE
for 300 evolutionary iterations. The algorithm is ini-
tialized with a population of 15 manually chosen trivial
network architectures with different numbers of convo-
lutional layers and kernel shapes. For DNN training,
we use stochastic gradient descent (SGD) with cosine
annealing [59], momentum of 0.9 and a weight decay of
0.0005. The learning rate for each training phase during
architecture search is initialized with 0.01. The train-
ing batch size is set to 64 throughout our experiments.
Furthermore, we apply commonly used data augmenta-
tions during training [59]. However, we leave out hori-
zontal image flips for GTSRB, since they would change
the meaning of some traffic signs. In addition, we use
mixup [105] and cutout [23] for further training data
augmentation.
The final population sizes of the CIFAR-10 and GT-
SRB models are 439 and 238, respectively. From each
of these, the 50 architectures with best validation er-
ror rates are selected and each of these is trained from
scratch on the set of training and validation images for
200 epochs. The learning rate is initialized with 0.025
in this case and all other hyperparameters stay the
same. Classification error is evaluated on the separate
test set after the training. Subsequently, we quantize
the networks’ weights and activations to an 8-bit fixed-
point representation using the MaxRange and MinPQE
10 Schorn et al.
methods described in Section 3.3 for further evalua-
tions.
4.2 Error simulations
Random bit-flip error simulations are used to evalu-
ate the actual resilience of the obtained set of neural
networks. For this purpose, we use the fault simula-
tion framework that has been previously described in
[82]. The framework builds up on the Keras [19] DNN
library with TensorFlow back-end [1]. This allows for
performing fast bit-level fault injections in the neuron
activation outputs (feature maps) of a CNN. Most of
the computation workload required for the simulation
can be efficiently computed on a GPU. The framework
automatically adds some operations behind each neu-
ron output stage of a given CNN, which emulate a fixed-
point format and allow for a bit-wise fault injection
in the neuron output memory by applying a definable
Boolean fault mask (see Fig. 1).
float to 
fixpoint
fixpoint 
to float
XOR
fault mask
feature
maps
feature maps 
with error
Fig. 1: Steps performed by fault injection framework
between the computation of two neural network layers
[82].
4.3 Results
4.3.1 Trade-off analysis between objectives
Table 1 lists the properties of certain DNN architectures
N obtained for both benchmarks, CIFAR-10 and GT-
SRB. The selected models are the ones that minimize
each an individual objective function fi(N) (BestASI,
BestValErr, BestEfficiency and BestADCR), the model
with maximum error sensitivity (WorstASI) as well as
the model with lowest normalized worst objective value
(see Section 2.3)
∥∥¯f(N)∥∥∞, i.e. the balanced optimizer
of all objectives (BalOpt). The BestEfficiency models
actually minimize both flatency(N) (i.e. operations) and
fenergy(N) (i.e. data transfer). This indicates a correla-
tion between the two quantities. The respective models
are also the smallest in terms of weight parameters.
It can be seen in Table 1 that choosing a DNN with
minimal cost in one objective often leads to the out-
come that at least one other objective is close to its
worst value. This is especially the case for CIFAR-10,
where
∥∥¯f(N)∥∥∞ is 1 or close to 1 for all single-objective
optimizers, BestASI, BestValErr, BestEfficiency, and
BestADCR. The optimal trade-off models (BalOpt),
however, come quite close to the ideal point, with nor-
malized distances of 0.371 (CIFAR-10) and 0.267 (GT-
SRB).
Another aspect visible in Table 1 is that 8-bit quan-
tization does not significantly increase test set clas-
sification error rates of the models in comparison to
the 32-bit float case (in some cases the error is even
smaller after quantization). The differences between the
MaxRange and MinPQE quantization methods with re-
spect to test error rate are marginal.
The resulting distributions of objective values for
all 50 models that were selected after the optimiza-
tion with LEMONADE are shown in Fig. 2 and Fig. 3 for
CIFAR-10 and GTSRB, respectively. The sub-figures
(a)–(d) each depict the outcomes of fASI(N) versus one
of the other objective functions. It can be seen that the
WorstASI models have comparatively few operations
and data transfers. However, the reverse is not always
true, since there are models with few operations and
data transfers as well as low ASI. In other words, it is
possible to have high efficiency and high error resilience
at the same time.
Another interesting aspect visible in Fig. 2 (d) and
Fig. 3 (d) is a correlation between ADCR and ASI. Con-
sequently, a low ratio of data transfers to operations is
not only beneficial for limiting the required bandwidth
of the DNN accelerator, but also helps to reduce er-
ror sensitivity. This aspect becomes also apparent in
Fig. 4. It can be seen that models with more operations
typically also require more data transfers. However, the
BestASI models have a relatively high number of oper-
ations in comparison to their data transfers, as they are
located offside the main trend in the scatter plot.
4.3.2 Evaluation of resilience prediction
We now evaluate the predictive performance of our ASI
metric by performing bit-flip fault injections using the
framework described in Section 4.2. Bit-flips are ran-
domly injected in all convolutional layer feature map
outputs (after ReLU activation and pooling, where ap-
plicable) that are written to memory. MinPQE quanti-
zation with 8 bits is used, except where otherwise speci-
fied. The value of each bit in the feature map outputs is
toggled with a probability given by a defined BER. To
get statistically meaningful results [52], random fault
Automated design of error-resilient and hardware-efficient deep neural networks 11
Table 1: Properties of models that optimize a respective cost dimension. E.g. model BestASI (second row) denotes
the optimizer of the ASI objective. Bold numbers indicate minimal values among the selected 50 models with best
validation error. Data transfer and accumulated data-computation ratio are calculated taking 8-bit fixed-point
quantization of activations and weights into account.
Model Optimized Quantities Other Quantities
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
(×
1
0
−
3
)
V
a
li
d
a
ti
o
n
S
et
E
rr
o
r
R
a
te
(%
)
O
p
er
a
ti
o
n
s
(G
O
p
/
F
ra
m
e)
D
a
ta
T
ra
n
sf
er
(M
B
/
F
ra
m
e)
A
cc
.
D
a
ta
-
C
o
m
p
u
ta
ti
o
n
R
a
ti
o
(B
/
O
p
)
N
o
rm
a
li
ze
d
W
o
rs
t
O
b
je
c-
ti
v
e
V
a
lu
e
N
u
m
b
er
o
f
P
a
ra
m
et
er
s
(×
1
0
6
)
T
es
t
S
et
E
rr
o
r
R
a
te
(%
)
(3
2
b
fl
o
a
t)
T
es
t
S
et
E
rr
o
r
R
a
te
(%
)
(8
b
M
a
x
R
a
n
g
e)
T
es
t
S
et
E
rr
o
r
R
a
te
(%
)
(8
b
M
in
P
Q
E
)
CIFAR-10
WorstASI 8.891 9.20 0.050 0.672 7.279 1.000 0.344 7.31 7.58 7.52
BestASI 0.336 9.16 0.420 2.112 1.422 0.959 1.645 6.95 6.91 6.87
BestValErr 4.267 6.52 0.186 2.381 10.230 0.996 1.489 5.48 5.33 5.41
BestEfficiency 1.750 9.18 0.049 0.665 10.264 1.000 0.337 6.54 6.68 6.61
BestADCR 0.336 9.30 0.429 2.122 1.150 0.993 1.654 6.42 6.57 6.47
BalOpt 0.970 7.56 0.127 1.668 4.241 0.371 1.330 5.72 5.66 5.63
GTSRB
WorstASI 8.120 0.45 0.045 0.478 10.218 1.000 0.101 2.53 2.66 2.64
BestASI 0.109 0.30 0.490 1.220 1.058 0.501 0.865 2.60 2.64 2.60
BestValErr 0.217 0.00 0.966 4.629 4.081 1.000 3.200 0.90 1.08 0.99
BestEfficiency 0.651 0.45 0.012 0.181 1.166 0.600 0.041 1.32 1.41 1.41
BestADCR 0.145 0.12 0.600 3.161 1.048 0.670 2.833 2.50 2.61 2.62
BalOpt 0.326 0.20 0.126 0.676 1.057 0.267 0.513 2.78 2.84 2.81
0 5 10
Test Classification Error (%)
at 8-bit Quantization
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErr
BestASI
WorstASI
(a)
0.0 0.2 0.4
Operations
(GOp/Frame)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErr
BestASI
WorstASI
(b)
0 1 2 3 4
Data Transfer
(MB/Frame)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErr
BestASI
WorstASI
(c)
0 5 10
Accumulated Data-Computation
Ratio (B/Op)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErr
BestASI
WorstASI
(d)
Fig. 2: Pairwise comparison of ASI with each of the other objective function outcomes for 50 Pareto-optimal
architectures on CIFAR-10.
locations are sampled n = 200 times and for each trial
the effect on the classification output of the network is
measured using the complete test set of the respective
benchmark. For this purpose, the classification change
rate (CCR), i.e. the fraction of images in the test set
that are classified differently after the fault injection, is
calculated. The sample mean of CCR over all n = 200
trials is reported. This can be interpreted as expected
probability of SDC at the given BER.
The results of a linear least-squares regression on the
ASI versus CCR value pairs of the 50 optimized models
for each benchmark are shown in Fig. 5. A BER of 0.003
was used for bit-flip injections. A correlation coefficient
R = 0.741 is achieved for CIFAR-10 and R = 0.898 for
GTSRB. While this indicates that the prediction is not
100% accurate, the correlation is relatively strong. This
is especially surprising, considering the fact that ASI is
completely determined by the architecture of the neural
network and does not require any cumbersome measure-
12 Schorn et al.
0 2 4
Test Classification Error (%)
at 8-bit Quantization
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErr BestASI
WorstASI
(a)
0.0 0.5 1.0
Operations
(GOp/Frame)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt BestValErrBestASI
WorstASI
(b)
0 1 2 3 4
Data Transfer
(MB/Frame)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErrBestASI
WorstASI
(c)
0 5 10
Accumulated Data-Computation
Ratio (B/Op)
0.0
0.2
0.4
0.6
0.8
1.0
A
rc
h
it
ec
tu
re
S
en
si
ti
v
it
y
In
d
ex
×10−2
BalOpt
BestValErrBestASI
WorstASI
(d)
Fig. 3: Pairwise comparison of ASI with each of the other objective function outcomes for 50 Pareto-optimal
architectures on GTSRB.
0.0 0.2 0.4
Operations (GOp/Frame)
0
1
2
3
4
D
at
a
T
ra
n
fe
r
(M
B
/F
ra
m
e)
BestASI
(a) CIFAR-10
0.0 0.5 1.0
Operations (GOp/Frame)
0
1
2
3
4
D
at
a
T
ra
n
fe
r
(M
B
/F
ra
m
e)
BestASI
(b) GTSRB
Fig. 4: Data transfer vs. number of operations for
Pareto-optimal architectures. BestASI models are lo-
cated offside the main trend.
0.0000 0.0025 0.0050 0.0075
Architecture Sensitivity Index
0
20
40
60
80
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(%
)
Fitted Line, R = 0.741
BER = 0.003
(a) CIFAR-10
0.0000 0.0025 0.0050 0.0075
Architecture Sensitivity Index
0
20
40
60
80
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(%
)
Fitted Line, R = 0.898
BER = 0.003
(b) GTSRB
Fig. 5: Correlation of ASI and CCR. A correlation
coefficient R = 0.741 is achieved for CIFAR-10 and
R = 0.898 for GTSRB.
ments based on test data or weight parameters. Thus,
we argue that ASI is an efficient and useful metric to
guide NAS towards more resilient DNN architectures.
We also evaluate CCRs for varying BERs for a sub-
set of models. The results for CIFAR-10 and GTSRB
are plotted in Fig. 6 and Fig. 7, respectively. An ap-
proximately linear dependency between BER and CCR
can be observed at very low bit error rates. At higher
BERs a transition first to a rapid growth of CCR (note
the log scales) is visible and then the value saturates at
a value corresponding to chance probability of choosing
the same label after fault injection.
10−5 10−4 10−3 10−2
Bit Error Rate (log scale)
10−3
10−2
10−1
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(l
og
sc
al
e)
WorstASI
BestValErr
BalOpt
BestASI
Fig. 6: Resulting CCR for different obtained optimizers
on CIFAR-10 over a range of BERs.
An interesting finding observable in Fig. 6 and Fig. 7
is that the BestValErr models exhibit an unexpectedly
low CCR at low BERs, while they degrade less grace-
fully (much steeper increase CCR) at high BERs. In
the case of GTSRB BestValErr is actually, despite its
higher ASI, much more resilient than BestASI at low
BERs. An explanation might be that a good baseline
classification performance adds an extra degree of error
resilience, which is not captured by ASI. The steeper
Automated design of error-resilient and hardware-efficient deep neural networks 13
10−5 10−4 10−3 10−2
Bit Error Rate (log scale)
10−3
10−2
10−1
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(l
og
sc
al
e)
WorstASI
BestValErr
BalOpt
BestASI
Fig. 7: Resulting CCR for different obtained optimizers
on GTSRB over a range of BERs.
increase, on the other hand, could be due to an overfit-
ting to the task (i.e. weaker ability for generalization).
4.3.3 Comparison of quantization methods
Finally, we compare the MaxRange and MinPQE quan-
tization methods (see Section 3.3), with respect to re-
sulting CCRs after bit-flip fault injections with a BER
of 0.005. Results are shown in Fig. 8 and Fig. 9. The
models are sorted in ascending order of CCR after Min-
PQE quantization in these figures.
0 10 20 30 40 50
Model
0
20
40
60
80
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(%
)
BestASI Model
WorstASI Model
MaxRange Quantization, BER = 0.005
MinPQE Quantization, BER = 0.005
Fig. 8: Comparison of CCR at bit error rate 0.005 for
CIFAR-10 models quantized with the MaxRange and
MinPQE quantization methods. Models sorted after
CCR observed with MinPQE quantization.
It can be seen that MaxRange results in a signif-
icantly worse CCR in most of the cases. This can be
explained by the fact that MaxRange tends to quantize
values to a larger range, which is determined by far out-
liers, while these outliers are ignored (i.e. clipped) by
0 10 20 30 40 50
Model
0
20
40
60
80
100
C
la
ss
ifi
ca
ti
on
C
h
an
ge
R
at
e
(%
)
BestASI Model
WorstASI Model
MaxRange Quantization, BER = 0.005
MinPQE Quantization, BER = 0.005
Fig. 9: Comparison of CCR at bit error rate 0.005
for GTSRB models quantized with the MaxRange and
MinPQE quantization methods. Models sorted after
CCR observed with MinPQE quantization.
MinPQE. Consequently, MaxRange leads to a weaker
signal-to-noise ratio compared to MinPQE in the case
of bit-flip errors. We thus argue that MinPQE is the
preferable method, since it achieves both, low baseline
classification error rates as well as high error resilience.
5 Conclusions
We have introduced a method for hardware-focused and
automated neural architecture design. Our proposed
hardware-specific objective functions, which only re-
quire network topology information for their evalua-
tion, enable a fast design space exploration and find-
ing of Pareto-optimal solutions of the NAS algorithm.
This makes our method efficient and applicable also for
more complex classification benchmarks than the ones
considered in this paper. We verified the accuracy of re-
silience prediction with memory bit-flip simulations and
found it to be reasonably accurate to guide our NAS al-
gorithm towards architectural resilience optimization.
Joint resilience, efficiency, and performance optimiza-
tion has not been considered in the context of NAS
before. Finally, our findings about the influence of dif-
ferent quantization techniques on DNN error resilience
highlight the importance of choosing an optimization
technique that fosters a high signal-to-noise ratio to
limit the influence of bit-flip errors.
References
1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen
Z, Citro C, Corrado GS, Davis A, Dean J, Devin
M, Ghemawat S, Goodfellow I, Harp A, Irving
14 Schorn et al.
G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kud-
lur M, Levenberg J, Mane D, Monga R, Moore
S, Murray D, Olah C, Schuster M, Shlens J,
Steiner B, Sutskever I, Talwar K, Tucker P, Van-
houcke V, Vasudevan V, Viegas F, Vinyals O,
Warden P, Wattenberg M, Wicke M, Yu Y, Zheng
X (2015) Tensorflow: Large-scale machine learn-
ing on heterogeneous distributed systems. URL
https://www.tensorflow.org/
2. Aitken R, Cannon EH, Pant M, Tahoori MB
(2015) Resiliency challenges in sub-10nm tech-
nologies. In: IEEE 33rd VLSI Test Symposium
(VTS), pp 1–4
3. Azizimazreah A, Gu Y, Gu X, Chen L (2018) Tol-
erating soft errors in deep learning accelerators
with reliable on-chip memory designs. In: IEEE
International Conference on Networking, Archi-
tecture and Storage (NAS), pp 1–10
4. Bach S, Binder A, Montavon G, Klauschen F,
Mu¨ller KR, Samek W (2015) On pixel-wise ex-
planations for non-linear classifier decisions by
layer-wise relevance propagation. PLOS ONE
10(7):e0130140
5. Baker B, Gupta O, Naik N, Raskar R (2017) De-
signing neural network architectures using rein-
forcement learning. In: International Conference
on Learning Representations
6. Baker B, Gupta O, Raskar R, Naik N (2017)
Accelerating Neural Architecture Search using
Performance Prediction. In: NIPS Workshop on
Meta-Learning
7. Bender G, Kindermans PJ, Zoph B, Vasudevan V,
Le Q (2018) Understanding and simplifying one-
shot architecture search. In: International Confer-
ence on Machine Learning
8. Blasco X, Herrero JM, Sanchis J, Mart´ınez
M (2008) A new graphical visualization of n-
dimensional Pareto front for decision-making in
multiobjective optimization. Information Sciences
178(20):3908–3924
9. Cai H, Chen T, Zhang W, Yu Y, Wang J (2018)
Efficient architecture search by network transfor-
mation. In: AAAI
10. Cai H, Yang J, Zhang W, Han S, Yu Y (2018)
Path-Level Network Transformation for Efficient
Architecture Search. In: International Conference
on Machine Learning
11. Cai H, Zhu L, Han S (2019) ProxylessNAS: Di-
rect neural architecture search on target task and
hardware. In: International Conference on Learn-
ing Representations
12. Cai L, Barneche AM, Herbout A, Sheng Foo C,
Lin J, Ramaseshan Chandrasekhar V, M Sabry
M (2018) TEA-DNN: the quest for time-energy-
accuracy co-optimized deep neural networks.
arXiv preprint
13. Carter NP, Naeimi H, Gardner DS (2010) Design
techniques for cross-layer resilience. In: Design,
Automation & Test in Europe Conference & Ex-
hibition (DATE), pp 1023–1028
14. Chen T, Goodfellow IJ, Shlens J (2016) Net2Net:
Accelerating learning via knowledge transfer. In:
International Conference on Learning Representa-
tions
15. Chen YH, Krishna T, Emer JS, Sze V (2017) Ey-
eriss: An energy-efficient reconfigurable accelera-
tor for deep convolutional neural networks. IEEE
Journal of Solid-State Circuits 52(1):127–138
16. Cheng AC, Dong JD, Hsu CH, Chang SH, Sun M,
Chang SC, Pan JY, Chen YT, Wei W, Juan DC
(2018) Searching toward Pareto-optimal device-
aware neural architectures. In: Proceedings of the
International Conference on Computer-Aided De-
sign, ICCAD ’18
17. Chenxi L, Liang Chieh C, Florian S, Hartwig A,
Wei H, Alan L Y, Li FF (2019) Auto-DeepLab: Hi-
erarchical neural architecture search for semantic
image segmentation. In: Conference on Computer
Vision and Pattern Recognition
18. Chollet F (2017) Xception: Deep learning with
depthwise separable convolutions. In: IEEE Con-
ference on Computer Vision and Pattern Recog-
nition (CVPR), pp 1800–1807
19. Chollet F, et al. (2015) Keras. URL
https://keras.io
20. Deb K, Kalyanmoy D (2001) Multi-Objective Op-
timization Using Evolutionary Algorithms. John
Wiley & Sons, Inc., New York, NY, USA
21. Deb K, Agrawal S, Pratap A, Meyarivan T (2000)
A fast elitist non-dominated sorting genetic algo-
rithm for multi-objective optimization: Nsga-ii. In:
Schoenauer M, Deb K, Rudolph G, Yao X, Lutton
E, Merelo JJ, Schwefel HP (eds) Parallel Problem
Solving from Nature PPSN VI, Springer Berlin
Heidelberg, Berlin, Heidelberg, pp 849–858
22. Deng J, Fang Y, Du Z, Wang Y, Li H, Temam
O, Ienne P, Novo D, Li X, Chen Y, Wu C (2015)
Retraining-based timing error mitigation for hard-
ware neural networks. In: Design, Automation &
Test in Europe Conference & Exhibition (DATE),
pp 593–596
23. DeVries T, Taylor GW (2017) Improved regu-
larization of convolutional neural networks with
cutout. eprint arXiv:1708.04552
24. Dias FM, Borralho R, Fontes P, Antunes A (2010)
FTSET: A software tool for fault tolerance eval-
Automated design of error-resilient and hardware-efficient deep neural networks 15
uation and improvement. Neural Computing and
Applications 19(5):701–712
25. Dong JD, Cheng AC, Juan DC, Wei W, Sun M
(2018) DPP-Net: Device-aware progressive search
for pareto-optimal neural architectures. In: Fer-
rari V, Hebert M, Sminchisescu C, Weiss Y (eds)
Computer Vision – ECCV 2018
26. Dreslinski RG, Wieckowski M, Blaauw D,
Sylvester D, Mudge T (2010) Near-threshold com-
puting: Reclaiming Moore’s law through energy
efficient integrated circuits. Proceedings of the
IEEE 98(2):253–266
27. Ehrgott M, Tenfelde-Podehl D (2003) Computa-
tion of ideal and Nadir values and implications for
their use in MCDM methods. European Journal of
Operational Research 151(1):119–139
28. El Mhamdi EM, Guerraoui R (2017) When neu-
rons fail. In: IEEE International Parallel and
Distributed Processing Symposium (IPDPS), pp
1028–1037
29. Elsken T, Metzen JH, Hutter F (2017) Simple
And Efficient Architecture Search for Convolu-
tional Neural Networks. In: NIPS Workshop on
Meta-Learning
30. Elsken T, Metzen JH, Hutter F (2019) Effi-
cient multi-objective neural architecture search
via Lamarckian evolution. In: International Con-
ference on Learning Representations
31. Elsken T, Metzen JH, Hutter F (2019) Neural ar-
chitecture search: A survey. Journal of Machine
Learning Research 20(55):1–21
32. Glorot X, Bordes A, Bengio Y (2011) Deep sparse
rectifier neural networks. In: International Confer-
ence on Artificial Intelligence and Statistics (AIS-
TATS), vol 15
33. Gomez LB, Cappello F, Carro L, DeBardeleben
N, Fang B, Gurumurthi S, Pattabiraman K, Rech
P, Reorda MS (2014) GPGPUs: How to combine
high computational power with high reliability. In:
Design, Automation & Test in Europe Conference
& Exhibition (DATE)
34. Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai
B, Liu T, Wang X, Wang G, Cai J, Chen T (2018)
Recent advances in convolutional neural networks.
Pattern Recognition 77:354–377
35. He K, Zhang X, Ren S, Sun J (2016) Deep residual
learning for image recognition. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR), pp 770–778
36. Henkel J, Bauer L, Dutt N, Gupta P, Nassif S,
Shafique M, Tahoori M, Wehn N (2013) Reliable
on-chip systems in the nano-era. In: 50th Annual
Design Automation Conference (DAC), pp 695–
704
37. Hinton G, Vinyals O, Dean J (2015) Dis-
tilling the knowledge in a neural net-
work. arXiv preprint abs/1503.02531, URL
https://arxiv.org/abs/1503.02531, 1503.02531
38. Horowitz M (2014) Computing’s energy problem
(and what we can do about it). In: IEEE Interna-
tional Solid- State Circuits Conference (ISSCC),
pp 10–14
39. Hsu CH, Chang SH, Juan DC, Pan JY, Chen YT,
Wei W, Chang SC (2018) Monas: Multi-objective
neural architecture search. arXiv preprint
40. Huang G, Liu Z, van der Maaten L, Weinberger
KQ (2017) Densely connected convolutional net-
works. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp 4700–4708
41. Hutter F, Kotthoff L, Vanschoren J (eds)
(2019) Automated Machine Learning: Meth-
ods, Systems, Challenges. Springer, available at
http://automl.org/book.
42. Ioffe S, Szegedy C (2015) Batch normalization:
Accelerating deep network training by reducing
internal covariate shift. In: Proceedings of the
32nd International Conference on Machine Learn-
ing
43. Jacob B, Kligys S, Chen B, Zhu M, Tang M,
Howard AG, Adam H, Kalenichenko D (2018)
Quantization and training of neural networks
for efficient integer-arithmetic-only inference. In:
IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR)
44. Kerlirzin P, Vallet F (1993) Robustness in multi-
layer perceptrons. Neural Computation 5(3):473–
482
45. Kim S, Howe P, Moreau T, Alaghi A, Ceze L,
Visvesh S (2018) MATIC: Learning around errors
for efficient low-voltage neural network accelera-
tors. In: Design, Automation & Test in Europe
Conference & Exhibition (DATE)
46. Kim YH, Reddy B, Yun S, Seo C (2017) NEMO:
Neuro-evolution with multiobjective optimization
of deep neural network for speed and accuracy. In:
ICML’17 AutoML Workshop
47. Klein A, Falkner S, Springenberg JT, Hutter F
(2017) Learning curve prediction with Bayesian
neural networks. In: International Conference on
Learning Representations
48. Koopman P, Wagner M (2016) Challenges in auto-
nomous vehicle testing and validation. SAE Inter-
national Journal of Transportation Safety 4(1):15–
24
49. Krizhevsky A (2009) Learning multiple layers of
features from tiny images. Master Thesis, Univer-
16 Schorn et al.
sity of Toronto
50. Krogh A, Hertz JA (1991) A simple weight decay
can improve generalization. In: Advances in Neu-
ral Information Processing Systems
51. LeCun Y, Bengio Y, Hinton G (2015) Deep learn-
ing. Nature 521(7553):436–444
52. Leveugle R, Calvez A, Maistri P, Vanhauwaert P
(2009) Statistical fault injection: Quantified error
and confidence. In: Design, Automation & Test
in Europe Conference & Exhibition (DATE), pp
502–506
53. Li G, Hari SKS, Sullivan M, Tsai T, Pattabiraman
K, Emer J, Keckler SW (2017) Understanding er-
ror propagation in deep learning neural network
(DNN) accelerators and applications. In: Proceed-
ings of the International Conference for High Per-
formance Computing, Networking, Storage and
Analysis
54. Lin DD, Talathi SS, Annapureddy VS (2016)
Fixed point quantization of deep convolutional
networks. In: Proceedings of the 33rd Interna-
tional Conference on Machine Learning, vol 48,
pp 2849–2858
55. Lin SC, Zhang Y, Hsu CH, Skach M, Haque ME,
Tang L, Mars J (2018) The architectural impli-
cations of autonomous driving: Constraints and
acceleration. In: International Conference on Ar-
chitectural Support for Programming Languages
and Operating Systems, pp 751–766
56. Liu C, Hu M, Strachan JP, Li H (2017) Rescuing
memristor-based neuromorphic design with high
defects. In: 54th Annual Design Automation Con-
ference (DAC), pp 1–6
57. Liu C, Zoph B, Neumann M, Shlens J, Hua W,
Li LJ, Fei-Fei L, Yuille A, Huang J, Murphy K
(2018) Progressive Neural Architecture Search. In:
European Conference on Computer Vision
58. Liu H, Simonyan K, Yang Y (2019) DARTS: Dif-
ferentiable architecture search. In: International
Conference on Learning Representations
59. Loshchilov I, Hutter F (2017) SGDR: Stochastic
gradient descent with warm restarts. In: Inter-
national Conference on Learning Representations
(ICLR)
60. Lu Z, Whalen I, Boddeti V, Dhebar Y, Deb K,
Goodman E, Banzhaf W (2019) NSGA-net: A
multi-objective genetic algorithm for neural archi-
tecture search
61. Mahdiani HR, Fakhraie SM, Lucas C (2012) Re-
laxed fault-tolerant hardware implementation of
neural networks in the presence of multiple tran-
sient errors. IEEE Transactions on Neural Net-
works and Learning Systems 23(8):1215–1228
62. Marques J, Andrade J, Falcao G (2017) Unreliable
memory operation on a convolutional neural net-
work processor. In: IEEE International Workshop
on Signal Processing Systems (SiPS)
63. Miettinen K (1999) Nonlinear Multiobjective Op-
timization. Springer Science & Business Media
64. Miikkulainen R, Liang J, Meyerson E, Rawal
A, Fink D, Francon O, Raju B, Shahrzad H,
Navruzyan A, Duffy N, Hodjat B (2017) Evolv-
ing Deep Neural Networks. In: arXiv:1703.00548
65. Mittal S (2016) A survey of techniques for ap-
proximate computing. ACM Computing Surveys
48(4):1–33
66. Montavon G, Lapuschkin S, Binder A, Samek W,
Mu¨ller KR (2017) Explaining nonlinear classifi-
cation decisions with deep Taylor decomposition.
Pattern Recognition 65:211–222
67. Montavon G, Samek W, Mu¨ller KR (2018) Meth-
ods for interpreting and understanding deep neu-
ral networks. Digital Signal Processing 73:1–15
68. Mutlu O (2017) The RowHammer problem and
other issues we may face as memory becomes
denser. In: Design, Automation & Test in Europe
Conference & Exhibition (DATE)
69. Pham H, Guan MY, Zoph B, Le QV, Dean J
(2018) Efficient neural architecture search via pa-
rameter sharing. In: International Conference on
Machine Learning
70. Piuri V (2001) Analysis of fault tolerance in ar-
tificial neural networks. Journal of Parallel and
Distributed Computing 61(1):18–48
71. Reagen B, Whatmough P, Adolf R, Rama S,
Lee H, Lee SK, Hernandez-Lobato JM, Wei GY,
Brooks D (2016) Minerva: Enabling low-power,
highly-accurate deep neural network accelerators.
In: ACM/IEEE 43rd Annual International Sym-
posium on Computer Architecture (ISCA), pp
267–278
72. Reagen B, Gupta U, Pentecost L, Whatmough P,
Lee SK, Mulholland N, Brooks D, Wei GY (2018)
Ares: A framework for quantifying the resilience
of deep neural networks. In: 55th Annual Design
Automation Conference (DAC)
73. Real E, Moore S, Selle A, Saxena S, Suematsu
YL, Tan J, Le QV, Kurakin A (2017) Large-scale
evolution of image classifiers. In: Precup D, Teh
YW (eds) Proceedings of the 34th International
Conference on Machine Learning, PMLR, Interna-
tional Convention Centre, Sydney, Australia, Pro-
ceedings of Machine Learning Research, vol 70, pp
2902–2911
74. Real E, Aggarwal A, Huang Y, Le QV (2019)
Aging Evolution for Image Classifier Architecture
Automated design of error-resilient and hardware-efficient deep neural networks 17
Search. In: AAAI
75. Saikia T, Marrakchi Y, Zela A, Hutter F, Brox T
(2019) Autodispnet: Improving disparity estima-
tion with automl
76. Salami B, Unsal OS, Kestelman AC (2018) On
the resilience of RTL NN accelerators: Fault char-
acterization and mitigation. In: 30th International
Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), pp 322–
329
77. Sandler M, Howard A, Zhu M, Zhmoginov A,
Chen LC (2018) MobileNetV2: Inverted residu-
als and linear bottlenecks. In: IEEE Conference
on Computer Vision and Pattern Recognition
(CVPR)
78. Santos FFd, Pimenta PF, Lunardi C, Draghetti L,
Carro L, Kaeli D, Rech P (2019) Analyzing and
increasing the reliability of convolutional neural
networks on GPUs. IEEE Transactions on Relia-
bility 68(2):663–677
79. Saxena S, Verbeek J (2016) Convolutional neural
fabrics. In: Lee DD, Sugiyama M, Luxburg UV,
Guyon I, Garnett R (eds) Advances in Neural In-
formation Processing Systems 29, Curran Asso-
ciates, Inc., pp 4053–4061
80. Schorn C, Guntoro A, Ascheid G (2018) Accurate
neuron resilience prediction for a flexible reliabil-
ity management in neural network accelerators.
In: Design, Automation & Test in Europe Confer-
ence & Exhibition (DATE)
81. Schorn C, Guntoro A, Ascheid G (2018) Effi-
cient on-line error detection and mitigation for
deep neural network accelerators. In: Gallina B,
Skavhaug A, Bitsch F (eds) Computer Safety, Re-
liability, and Security (SAFECOMP), Springer,
LNCS, vol 11093
82. Schorn C, Guntoro A, Ascheid G (2019) An ef-
ficient bit-flip resilience optimization method for
deep neural networks. In: Design, Automation &
Test in Europe Conference & Exhibition (DATE),
pp 1486–1491
83. Sridharan V, DeBardeleben N, Blanchard S, Fer-
reira KB, Stearley J, Shalf J, Gurumurthi S (2015)
Memory errors in modern systems: The good, the
bad, and the ugly. In: Twentieth International
Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASP-
LOS), pp 297–310
84. Srinivasan G, Wijesinghe P, Sarwar SS, Jaiswal
A, Roy K (2016) Significance driven hybrid 8T-6T
SRAM for energy-efficient synaptic storage in ar-
tificial neural networks. In: Design, Automation &
Test in Europe Conference & Exhibition (DATE)
85. Srivastava N, Hinton GE, Krizhevsky A, Sutskever
I, Salakhutdinov RR (2014) Dropout: A simple
way to prevent neural networks from overfitting.
Journal of Machine Learning Research 15:1929–
1958
86. Stallkamp J, Schlipsing M, Salmen J, Igel C (2012)
Man vs. computer: Benchmarking machine learn-
ing algorithms for traffic sign recognition. Neural
Networks 32:323–332
87. Stanley KO, Miikkulainen R (2002) Evolving neu-
ral networks through augmenting topologies. Evo-
lutionary Computation 10:99–127
88. Sze V, Chen YH, Yang TJ, Emer JS (2017)
Efficient processing of deep neural networks: A
tutorial and survey. Proceedings of the IEEE
105(12):2295–2329
89. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna
Z (2016) Rethinking the inception architecture for
computer vision. In: IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR),
pp 2818–2826
90. Tan M, Chen B, Pang R, Vasudevan V, Le QV
(2018) Mnasnet: Platform-aware neural architec-
ture search for mobile. arXiv preprint
91. Torres-Huitzil C, Girau B (2017) Fault and error
tolerance in neural networks: A review. IEEE Ac-
cess 5:17322–17341
92. Vanhoucke V, Senior A, Mao MZ (2011) Improv-
ing the speed of neural networks on CPUs. In:
Deep Learning and Unsupervised Feature Learn-
ing Workshop, NIPS 2011
93. Venkataramani S, Ranjan A, Roy K, Raghu-
nathan A (2014) AxNN: Energy-efficient neuro-
morphic systems using approximate computing.
In: IEEE/ACM International Symposium on Low
Power Electronics and Design (ISLPED), pp 27–
32
94. Vogel S, Springer J, Guntoro A, Ascheid G (2019)
Self-supervised quantization of pre-trained neural
networks for multiplierless acceleration. In: De-
sign, Automation & Test in Europe Conference
& Exhibition (DATE), pp 1088–1093
95. Wei T, Wang C, Rui Y, Chen CW (2016) Net-
work morphism. In: Balcan MF, Weinberger KQ
(eds) Proceedings of The 33rd International Con-
ference on Machine Learning, PMLR, New York,
New York, USA, Proceedings of Machine Learning
Research, vol 48, pp 564–572
96. Whatmough PN, Lee SK, Brooks D, Wei GY
(2018) DNN Engine: A 28-nm timing-error toler-
ant sparse deep neural network processor for IoT
applications. IEEE Journal of Solid-State Circuits
53(9):2722–2731
18 Schorn et al.
97. WikiChip (2019) FSD Chip - Tesla. URL
https://en.wikichip.org/wiki/fsd chip
98. Williams S, Waterman A, Patterson D (2009)
Roofline: An insightful visual performance model
for multicore architectures. Communications of
the ACM 52(4):65–76
99. Wu B, Dai X, Zhang P, Wang Y, Sun F, Wu Y,
Tian Y, Vajda P, Jia Y, Keutzer K (2019) FBNet:
Hardware-aware efficient convnet design via differ-
entiable neural architecture search. arXiv preprint
100. Xia L, Liu M, Ning X, Chakrabarty K, Wang Y
(2017) Fault-tolerant training with on-line fault
detection for RRAM-based neural computing sys-
tems. In: 54th Annual Design Automation Con-
ference (DAC)
101. Xie S, Zheng H, Liu C, Lin L (2019) SNAS:
stochastic neural architecture search. In: Interna-
tional Conference on Learning Representations
102. Yang L, Murmann B (2017) SRAM voltage scal-
ing for energy-efficient convolutional neural net-
works. In: 18th International Symposium on Qual-
ity Electronic Design (ISQED), pp 7–12
103. Zela A, Elsken T, Saikia T, Marrakchi Y, Brox T,
Hutter F (2019) Understanding and Robustifying
Differentiable Architecture Search. arXiv preprint
104. Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong
J (2018) Caffeine: Towards uniformed representa-
tion and acceleration for deep convolutional neural
networks. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems
105. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D
(2018) mixup: Beyond empirical risk minimiza-
tion. In: International Conference on Learning
Representations (ICLR)
106. Zhang Q, Wang T, Tian Y, Yuan F, Xu Q (2015)
ApproxANN: An approximate computing frame-
work for artificial neural network. In: Design, Au-
tomation & Test in Europe Conference & Exhibi-
tion (DATE), pp 701–706
107. Zhong Z, Yang Z, Deng B, Yan J, Wu W, Shao
J, Liu CL (2018) BlockQNN: Efficient block-wise
neural network architecture generation. arXiv
preprint
108. Zoph B, Le QV (2017) Neural architecture search
with reinforcement learning. In: International
Conference on Learning Representations
109. Zoph B, Vasudevan V, Shlens J, Le QV (2018)
Learning transferable architectures for scalable
image recognition. In: Conference on Computer
Vision and Pattern Recognition
