Applying deep neural networks (DNNs) in mobile and safety-critical systems, such as autonomous vehicles, demands a reliable and efficient execution on hardware. Optimized dedicated hardware accelerators are being developed to achieve this. However, the design of efficient and reliable hardware has become increasingly difficult, due to the increased complexity of modern integrated circuit technology and its sensitivity against hardware faults, such as random bit-flips. It is thus desirable to exploit optimization potential for error resilience and efficiency also at the algorithmic side, e.g. by optimizing the architecture of the DNN. Since there are numerous design choices for the architecture of DNNs, with partially opposing effects on the preferred characteristics (such as small error rates at low latency), multi-objective optimization strategies are necessary. In this paper, we develop an evolutionary optimization technique for the automated design of hardware-optimized DNN architectures. For this purpose, we derive a set of easily computable objective functions, which enable the fast evaluation of DNN architectures with respect to their hardware efficiency and error resilience solely based on the network topology. We observe a strong correlation between predicted error resilience and actual measurements obtained from fault injection simulations. Furthermore, we analyze two dif-ferent quantization schemes for efficient DNN computation and find significant differences regarding their effect on error resilience.
Introduction
The application of deep neural networks (DNNs) in safety-critical perception systems, for example autonomous vehicles (AVs), poses some challenges on the design of the underlying hardware platforms. On the one hand, efficient and fast accelerators are needed, since DNNs for computer vision exhibit massive computational requirements [55] . On the other hand, resilience against random hardware faults has to be ensured. In many driving scenarios, entering a fail-safe state is not sufficient, but fail-operational behavior and fault tolerance are required [48] . However, fault tolerance techniques at the hardware level often entail large redundancy overheads in silicon area, latency, and power consumption. These overheads stand in contrast to the low-power and low-latency requirements of embedded real-time DNN accelerators. Reliability concerns in nanoscale integrated circuits, for instance soft errors in memory and logic, represent an additional challenge for the realization of fault tolerance mechanisms at the hardware level [2, 33, 36, 68, 83] . Moreover, techniques such as near-threshold computing [26] and approximate computing [65] are desirable to meet power constraints, but can further increase error rates.
To overcome these challenges, one option is to exploit error resilience at the algorithm level and allow for a certain degree of inaccuracy at the hardware level.
arXiv:1909.13844v1 [cs.LG] 30 Sep 2019
This is referred to as cross-layer resilience [13] . Due to the implicit information redundancy of neural networks, they offer some robustness against random internal perturbations, which can be exploited in a crosslayer resilience approach. Nevertheless, error resilience is strongly influenced by the architectural design of the DNN [82] as well as its internal data representations [53] . These design choices, in turn, also influence hardware efficiency and classification performance of the network. Taking these multiple, partially opposing objectives into account in a manual DNN design procedure is non-trivial and cumbersome.
As a solution, we develop and evaluate an efficient, automated, multi-objective neural architecture search (NAS) technique in this paper, which holistically takes classification performance as well as hardware-specific objective functions into account. In detail, our contributions are the following:
1. We derive a set of objective functions for the prediction of error resilience, energy consumption, latency and required bandwidth of DNNs on hardware, solely based on the topology of their neural architecture, allowing a fast evaluation of these objectives by avoiding the need for expensive simulations or training of the neural network. 2. We integrate these functions in an efficient, evolutionary, multi-objective NAS algorithm, that uses (approximate) network morphisms for a fast Pareto optimization of DNNs. 3. We evaluate our methods and obtained Pareto trade-offs on two popular image classification benchmarks, namely CIFAR-10 and German Traffic Sign Recognition Benchmark (GTSRB). In particular, we test the predictive performance of our fast error resilience prediction metric by taking silent data corruption (SDC) measurements, employing a memory bit-flip fault injection framework. 4. We compare two recently introduced quantization techniques for hardware-efficient DNN inference with respect to resulting classification performance and error resilience characteristics of the neural networks.
To the best of our knowledge, this is the first paper combining error resilience and hardware efficiency optimization in the context of neural architecture search. The remainder of this paper is structured as follows. In Section 2, we give an overview of related work. In Section 3, we introduce our methodology. This includes the derivation of hardware-specific objective functions, neural network quantization techniques and the multiobjective optimization algorithm used in this paper. In Section 4, we evaluate the outcome of our methods on two image classification benchmarks. We analyze the trade-offs between Pareto-optimal solutions, perform fault injections to compare predicted and measured resilience, and evaluate the characteristics of two different DNN quantization methods. We close our paper with a summary and conclusions in Section 5.
Background and related work
We now give an overview of related error resilience analysis (Section 2.1), resilience optimization techniques for neural networks (Section 2.2) as well as preliminaries on multi-objective optimization (Section 2.3) and neural architecture search (NAS) (Section 2.4).
Neural network resilience analysis
Understanding a neural network's resilience against erroneous perturbations in its internal computations has been a topic of interest for decades already. Here, we give an overview of the most recent studies that target error resilience analysis of modern DNNs. An in-depth review of previous literature has been recently given by Torres-Huitzil and Girau [91] .
Experimental analysis
The majority of studies on error resilience in neural networks has been experimental. They range from physical fault induction experiments in real hardware devices [78, 96] , over fault injections in (virtual) hardware models [3, 53, 76, 78] , to error simulations at the algorithmic behavior level [62, 72, 80] . Behavioral analysis can be connected to realistic hardware faults in a second step, by mapping the effect of these faults to error models in the algorithm domain [70] . For the modelbased analysis, stuck-at-zero, stuck-at-one and random bit-flips of memory cells are commonly used. Stuck-at types are used to model permanent faults (e.g. resulting from manufacturing defects) and bit-flips are typically used to model radiation-induced transient faults that lead to soft errors [91] .
In summary, experimental studies found different determinants of neural network resilience, the most important being the number and type of errors, the data representation of the neural network, the DNN type and the location where the error occurs. However, while experimental evaluation is useful for an accurate a posteriori resilience determination of a given DNN on hardware, it is cumbersome and provides only limited insight into a priori design choices for DNN developers to improve resilience at the algorithm level.
Theoretical analysis
A theory-guided resilience analysis offers the advantage of being more directly interpretable and avoids lengthy fault injection experiments. El Mhamdi and Guerraoui [28] analytically derived easily computable bounds for the forward error propagation of neurons that are stuckat-zero (crashed neurons) and for neurons that transmit arbitrary values (Byzantine neurons). They found that the choice of activation function and number of neurons per layer are design choices that affect the forward error propagation. More precisely, an activation function with a low Lipschitz constant as well as a high number of neurons per layer can reduce forward error propagation.
A different analytical technique to derive neuron resilience prediction has been used in the context of approximate neural network computing. Backpropagation of error gradients, comparable to the technique used to determine weight updates during neural network training, has been used to estimate the average output sensitivity to perturbations in individual neurons [93, 106] .
Recently, Schorn et al. [80] showed that a technique based on layer-wise relevance propagation (LRP) [4] outperforms gradient-based resilience prediction. Contrarily to gradient methods, which determine the sensitivity to small perturbations in neurons, LRP attributes to each neuron its absolute contribution to the DNN output [67] , which can be interpreted as layerwise Taylor decomposition [66] . A high neuron relevance, averaged over a training set of input samples, corresponds to a high sensitivity against errors [80] .
Neural network resilience optimization
The optimization of neural network error resilience at the algorithm level is an active field of research. A number of publications simulate the effects of hardware faults during neural network training to improve resilience [22, 102, 45, 100, 56] . Reference [22] considers timing variations, [102, 45] static random-access memory (RAM) supply voltage scaling and [100, 56] hard defects in memristors and resistive RAM respectively. The drawback of these approaches is that they complicate the training process, since fault injections have to be performed by placing hardware in the training loop or through realistic fault simulations. Common regularizing techniques, such as dropout [44, 85] and weight decay [50] , also improve the general error resilience of neurons [28] .
A second approach is to adjust the mapping of the algorithm to hardware for an optimized resilience. A significance-driven mapping of network weight bits to memory cells with different resilience has been suggested in [84] . However, the authors did not follow an analytical approach to determine weight resiliencies, but relied on their experience. In contrast, the LRPbased method in [80] gives a theoretically founded resilience mapping of neurons.
A third approach is to use modifications in hardware that are tailored to exploit the algorithmic resilience properties of neural networks. This can be zero-biased [3] or selectively hardened [53] memory cells, optimized data representations [96] , masking techniques [71, 76] , anomaly detectors [53, 81] and relaxed versions of classical fault tolerance mechanisms, such as triple modular redundancy (TMR) [61] and algorithm-based fault tolerance (ABFT) checksums [78] .
Modifications of the neural architecture to increase resilience have been proposed as well. Dias et al. [24] suggest a resilience optimization procedure by replication of critical neurons and weights. However, they use exhaustive simulation to determine criticality values, which is infeasible for large-scale DNNs. Schorn et al. [82] showed that critical layers can be identified using LRP. Nevertheless, no automated neural architecture design technique that jointly optimizes error resilience as well as other desirable performance and efficiency objectives of DNNs has been introduced so far.
Multi-objective optimization
In multi-objective optimization (see, e.g. [63] ), one tries to optimize multiple, complementary objective functions f 1 , . . . , f k over a space N of feasible solutions (in our case: a space of neural network architectures). Usually, there will be no N * ∈ N that minimizes all objectives f 1 , . . . , f k at the same time (as the objectives are complementary). Instead, there are multiple Paretooptimal solutions meaning that one cannot reduce any f i without increasing at least one f j . Formally, we say that
. , n} and f j (N 1 ) < f j (N 2 ) for at least one j. N * is called Pareto-optimal iff N * is not dominated by any other N ∈ N . The set of Pareto-optimal solutions is the so-called Pareto front. Typically, multi-objective optimization can only determine a subset P that approximates this Pareto front.
In order to rate the overall performance of a given neural network N ∈ P across all objectives, the distance to the ideal point can be used as a metric [8] . The (approximate) coordinates y i of the ideal point in each objective dimension i ∈ {1, . . . , k} can be determined by taking the componentwise minima of the objective functions f i (N ) over the (approximated) Pareto front P [27] :
(1)
To enhance comparability, a normalized version of the distance to the ideal point can be computed [8] . Therefore, the individual objective functions are first normalized
so that 0 ≤f i (N ) ≤ 1. Then, a norm on the vector f(N ) = (f 1 (N ), . . . ,f k (N )) ∈ R k is computed to measure the distance of N from the ideal point. Blasco et al. [8] suggest to take the infinity norm for the purpose of trade-off analysis:
That way, a score between 0 and 1 is obtained, which supplies information about the worst objective value. For example, a value of 1 means that N has the worst observed performance in at least one of the objectives.
We refer to f (N ) ∞ as normalized worst objective value in the remainder of this paper.
Neural Architecture Search
One crucial aspect for the success of deep learning in recent years was the design of novel neural network architectures [35, 40, 77, 89] . However, manually designing such architectures is a cumbersome trial-and-error process. To overcome the need for architectural engineering, neural architecture search (NAS) -the process of automatically designing neural network architectures -has arisen as a subfield of automated machine learning [41] . By now, architectures found by NAS have outperformed human-designed architectures on a variety of tasks such as image recognition [74] , object detection [109] or dense prediction tasks [17, 75] . We briefly summarize related work here and refer to the survey by Elsken et al. [31] for a more thorough literature overview. Reinforcement learning techniques [5, 108, 107, 109] or evolutionary methods [87, 64, 73, 74] were employed to search for well performing architectures. As early work required vast amount of computational resources, often in the range of hundreds or even thousands of GPU days [108, 109, 74] , making NAS more efficient was the focus of many researchers, e.g. by employing network morphisms [9, 10, 29] , by sharing weights [79, 7, 69] or by performance prediction [6, 47] . A recent series of work [58, 101, 11, 103 ] employed a real-valued relaxation of the discrete architecture search space, enabling gradient-based optimization.
While the previously discussed approaches solely optimize for a single objective, namely minimizing some error rate, there has also been some work on multiobjective neural architecture search [46, 25, 90, 60, 99, 16, 39, 30] , optimizing other objectives such as network size, latency or energy consumption concurrently. [25] extend [57] by considering multiple objectives during the model selection step. [60] employ NSGA-II [21] , a well known multi-objective optimization algorithm, in the context of NAS. Instead of actually solving the multi-objective problem, many researchers use scalarization methods, such as the weighted product or sum method [20] , to obtain a single objective. This is then optimized via, e.g. reinforcement learning [90, 39] or differentiable NAS [99] . [12] use multi-objective Bayesian optimization to search for convolutional cells [109] . In this work, we will build up on the multi-objective evolutionary method LEMONADE [30] that exploits cheap-toevaluate objectives to make the search more efficient. This perfectly fits our application as we will see later as our objectives are solely based on the neural network architecture (and not, e.g., on expensive simulations or trained neural network weights) and thus cheap to compute. We discuss LEMONADE more detailed in Section 3.2.1.
Hardware-focused neural architecture design
In this section, we introduce our framework for the automated design of error-resilient and hardwareefficient DNN architectures. In a first step, we identify optimization goals that typically appear in embedded DNN hardware applications and derive corresponding objective functions (Section 3.1). In the further course of this paper, these functions serve as input to a multi-objective neural architecture search algorithm (Section 3.2). Fixed-point quantization is applied as post-processing step after NAS to enable efficient DNN execution on dedicated hardware accelerators (Section 3.3).
Hardware-specific objectives
We consider four different objectives that are commonly desirable in embedded DNN hardware applications, namely high error resilience, low latency, high energy efficiency and a low bandwidth requirement.
Error resilience
In the context of this paper, error resilience is regarded as robustness of the neural network classifier against perturbations in its neuron activation values. Such perturbations can be the result of random hardware faults, such as radiation-induced bit-flips. We measure the degree of perturbation using bit error rate (BER), i.e. the fraction of flipped bits across all activations of the DNN. We define architecture sensitivity at a given BER as probability for the predicted class output to differ, with and without bit errors. In order to maximize error resilience, we want to minimize architecture sensitivity.
Following the approach in [80] and [82] , we derive an architecture-dependent error sensitivity metric using LRP. A key prerequisite in the mathematical framework of LRP is the relevance conservation principle [67] . It ensures that the total amount of neuron relevance, which is propagated backwards through the network after the forward pass of inference on an input sample, is conserved in each layer. Consequently, for a group of neurons k and its inputs j,
where r j and r k are the relevance values attributed to neurons j and k, respectively, and r j←k is the amount of relevance propagated backwards from neuron k to neuron j. The conservation principle is motivated by the fact that an output activation of neuron k can be completely decomposed into contributions of its input neurons j.
The relevance distribution among the neurons in each layer depends on their activations and the synaptic weights [67] . For the initial backpropagation step, the final output neuron relevance of the DNN is predetermined by the one-hot encoded target vector belonging to each input sample. This ensures that k r k = 1 in each layer. Consequently, for a uniformly randomly drawn neuron in a layer l, the expected relevance is E r
where n (l) outputs is the total number of neurons in that layer. The observation that a higher average relevance corresponds to a more likely change of the DNN classification output suggests that layers with few neurons are more sensitive to errors [80, 82] .
Effect of max-pooling. Max-pooling is commonly used in some layers of a DNN, in order to reduce the output dimensions of that layer [51] . A max-pooling stage divides the outputs of a layer into subsets and selects the maximum output value out of each subset. We do not regard max-pooling as a separate layer, but consider it as attachment to a layer. If a layer l has max-pooling, the reduced number of output values after the pooling stage is taken to calculate n (l) outputs . Additionally, we observed an increased error sensitivity of neurons in layer l if max-pooling is present in the subsequent layer l + 1. We suppose that this is because information about the input sample is reduced by the pooling stage, but critical errors, which are mostly changes from a low to a high activation value [53] , are likely to propagate through. Thus, we obtain an effective error sensitivity of neurons in layer l by multiplication with the pooling factor of layer l + 1. The pooling factor is the fraction of input to output dimension of the pooling stage and equals 4 for the max-pooling layers that we use throughout our experiments.
Effect of merge layers. Skip connections, i.e. concurrent paths through the network, can improve the training of deep architectures and thus have become popular in state-of-the-art DNNs [34] . At some point in the network, the parallel paths have to be merged again, which can be done by componentwise addition [35] or by feature concatenation [89] . While a concatenation does not affect error propagation, an add layer increases error sensitivity of the DNN. There are two reasons for this. Firstly, an add layer involves additional (error prone) load, accumulate and store operations, while concatenation only involves the change of the address range from which data is loaded in the subsequent layers.
Secondly, the fraction of neurons affected by errors is likely to increase through the add operation. If two inputs with an equal and small fraction of erroneous neurons are added, the resulting fraction of erroneous neurons doubles as long as the error locations in the inputs do not coincide. This can be regarded as doubling the effective error sensitivity of the neurons preceding the add operation.
Architecture sensitivity index. The aforementioned insights are now used to define a metric that estimates the error sensitivity of a neural network N solely based on the topology of its architecture. We call this metric architecture sensitivity index (ASI). It is defined as sum of the expected error sensitivities over all layers L N of N ,
where λ (l) is the max-pooling factor of the succeeding layer l + 1 (i.e. 1 for no pooling) and ζ (l) is 2, if l is connected to an add layer, else 1.
Concatenations are not counted as extra layers in this sum, while add layers are. Furthermore, supportive functionalities, such as activation function, pooling and batch normalization [42] are not considered as separate layers, but included in the neuron layers.
We want to emphasize that f ASI can be computed very easily, since it only depends on the network topology and does not require any training or other expensive computations.
Latency
Aside from error resilience, real-time inference with low latency is an additional necessity in many applications. AVs, for instance, should be able to derive driving actions from sensory input in less than 100 ms, in order to surpass human-level perception performance and provide a sufficient level of safety [55] . While low latency can be achieved by employing a parallelized hardware architecture and a high operating frequency, the performance of a DNN accelerator is constrained by manufacturing, power consumption, reliability, and flexibility requirements. Thus, a reduction of computational complexity at the algorithm level is desirable.
The roofline model [98] is commonly used to describe the attainable computational performance of a DNN accelerator [104] . It defines two operational domains, which are entered depending on the computational workload of the accelerator. In the memory-bound domain, latency is determined by the amount of data transfer to memory and the available memory bandwidth. In the compute-bound domain, latency can be regarded as being proportional to the number of operations required by the algorithm.
Being compute-bound is preferable over memorybound operation, since it allows maximum utilization of the available computational resources and highest throughput. Thus, we assume an accelerator, whose memory bandwidth is sufficiently large so that it will predominantly operate in the compute-bound domain for the workloads considered in this paper. We can therefore take the number of operations of the DNN as approximate determinant of latency. Furthermore, we regard the number of operations as being solely dependent on the neural network architecture, i.e. we do not consider any data-dependent operation reductions.
Our objective function for latency reduction is given by
where n
op counts the number of operations of layer l.
Energy efficiency
A further frequent demand on embedded DNN accelerators is a low energy consumption per classification inference. This can have mainly two reasons. Firstly, mobile devices have a limited amount of energy storage capacity and thus energy-efficient DNN accelerators are required, for example to extend the battery life and range of AVs. Secondly, embedded devices often have a strict size limitation, which makes it difficult to realize the necessary heat dissipation. As the thermal leakage power of an accelerator directly depends on the number of classifications per second and the energy per classification, energy efficiency is desirable to enable high classification throughput. Energy consumption of DNN accelerators is dominated by data transfers to and from memory [88] . This is due to the large amount of parameters and intermediate data outputs of typical large-scale DNNs.
According to Horowitz [38] , energy consumption for off-chip dynamic RAM access is about two orders of magnitude higher than for internal cache accesses and arithmetic operations. While some hardware designers increase energy efficiency by integrating huge on-chip static RAMs in their DNN accelerator (e.g. [97] ), this approach is not feasible in every case. In this paper, we assume an accelerator with small on-chip buffer (such as [15] ), so that a layerwise data transfer to and from off-chip memory is necessary, which dominates energy consumption.
Consequently, to maximize energy efficiency, our objective is to minimize data transfer to and from memory per inference. We neglect the number of operations in this calculation because of its limited influence on energy consumption and since it is already part of the latency minimization objective function. To determine the data transfer of a layer, we assume that each input and weight parameter of the layer is loaded once from external memory and each output is written back once. Furthermore, we assume that the same bit-width is used to represent all activations and parameters of the network.
Our objective function for minimizing energy consumption is thus given by the sum of layerwise input, output, and parameter data word transfers over the whole network,
where n (l) inputs and n (l) outputs count the number of input neurons and output neurons, respectively, and n (l) params counts the number of parameters of layer l.
Bandwidth requirement
As described in Section 3.1.2, we assume the accelerator for which we optimize DNN architectures to operate predominantly in the compute-bound domain of the roofline model. In order to guarantee compute-bound operation, the accelerator has to provide a certain maximum bandwidth to memory. It is desirable to keep this bandwidth requirement within bounds to simplify the accelerator architecture.
The required memory bandwidth can vary for the different layers of a DNN. We employ the ratio between data transfers and operations of a layer as estimator for its bandwidth requirement. The intuition behind this is that a low number of operations is related to a short processing time of the layer and consequently a high bandwidth is required to be able to perform the necessary data movements in that given time.
We define an overall objective function to optimize neural architectures for a low bandwidth requirement by adding up the data-computation ratios of all layers. Thus our objective function for minimizing the bandwidth requirement is given by the accumulated datacomputation ratio (ADCR):
Multi-objective NAS
In the following, we introduce LEMONADE, a Lamarckian Evolutionary algorithm for Multi-Objective Neural Architecture DEsign [30] , that we will use in our later experiments to automatically design well-performing, error-resilient, and hardware-efficient architectures.
LEMONADE
LEMONADE maintains a population P of neural networks N . This population is improved over the course of the algorithm with respect to multi-objective optimization problem min N ∈N f(N ), where N denotes a suitable space of neural network architectures (see Section 3.2.2) and the objective function
is split into expensive-to-evaluate objectives f exp (N ) ∈ R m (in our case: the validation error, only obtainable by expensive training) and cheap-to-evaluate objectives f cheap (N ) ∈ R n (in our case: the objectives defined in Section 3.1). The population P is chosen to comprise all non-dominated networks with respect to f, i.e. the population approximates the Pareto front. LEMONADE exploits that f cheap is cheap to evaluate in order to bias the sampling of children towards areas of the Pareto front of f cheap that are sparsely populated. While f cheap is evaluated many times in LEMONADE, f exp is evaluated only a few times for promising networks that are likely to improve the approximation of the Pareto front.
In every iteration of LEMONADE, firstly parent networks are sampled with respect to some probability distribution (discussed later) that is solely based on the cheap objectives. By applying mutations to the parents (such as adding or removing a layer, see Section 3.2.2 for a detailed description), children are generated. In a second sampling stage, a subset of all generated children is selected, again solely based on cheap objectives, and solely this subset is evaluated on the expensive objectives f exp . Lastly, LEMONADE computes the Pareto front from the current generation and the subset of generated children, yielding the next generation. The described procedure is repeated for a pre-specified number of iterations.
The sampling distribution. The sampling distribution is designed to only depend on the cheap objectives and to guide the search towards sparsely crowded regions in the current Pareto front. In order to achieve this, LEMONADE computes a kernel density estimator p KDE on the cheap objective values {f cheap (N )|N ∈ P} of the current population. Then, for both sampling stages (i.e. (i) the probability for choosing a network N as a parent as well as (ii) the probability of a generated child N being part of the subset), LEMONADE uses a sampling distribution anti-proportional to p KDE :
with a proper normalizing constant c. Therefore, networks in sparsely populated regions of the Pareto front are more likely to be chosen as parents and generated children lying in sparsely populated regions of the Pareto front are more likely to be evaluated on f. The motivation behind also choosing parents in less crowded regions is that mutations do not change the network drastically, hence children are expected to have similar objective values as their parents. By this sampling distribution and the two-staged sampling strategy, LEMONADE generates and evaluates more children that are more likely to improve the current approximation of the Pareto front rather then just evaluating the cheap objective f exp (N ) for all children, making it more efficient than off-the-shelf multiobjective optimization algorithms. We highlight that all objectives from Section 3.1 are cheap-to-evaluate as they all solely depend on the neural network architecture and not, e.g. on the weights of the network only obtainable by expensive training. Hence, LEMONADE is a perfect fit for our purpose. For more details, we refer the reader to the original work [30] .
Search space and mutations within LEMONADE
In this work, we focus on NAS for image classification tasks. Convolutional neural networks (CNNs) are the predominantly used type of DNN in this domain [51] . However, in the recent years, the number of variations and design choices for CNN architectures has significantly grown (see e.g. [34] for an overview). We limit the search space of LEMONADE to a number of predefined building blocks, hyperparameters and allowed mutations for two reasons. Firstly, support for a limited set of building blocks requires less flexibility of the underlying hardware. This enables the use of more efficient dedicated DNN accelerators instead of general purpose hardware. Secondly, the space of feasible architectures N rapidly grows with each additional variation that is allowed. This combinatorial explosion slows down the convergence of NAS, which is why a reasonable limitation of the search space has to be chosen. We now describe the set of mutations that are used by LEMONADE in our experiments to generate child networks.
Insert a convolutional layer with batch normaliza-
tion [42] and rectified linear unit (ReLU) activation [32] . The layer is inserted at a random position and its number of filters is chosen to match the number of filters of the preceding layer. The kernel height h and width w of the convolutional filter are randomly sampled: (h, w) ∈ {(3, 3), (5, 5), (7, 7) , (9, 9) }. 2. Increase the number of filters of a randomly chosen convolution by a randomly chosen factor ∈ {2, 4}. A maximum of 1100 filters is allowed. 3. Add a skip connection. We allow skip connection either by concatenation [89] or by addition [35] . 4. Remove a randomly chosen layer or a skip connection. 5. Prune a randomly chosen convolutional layer (i.e. remove 1/2 or 1/4 of its filters). A minimum of 15 filters is allowed. 6. Replace a randomly chosen convolution by a depthwise separable convolution [18] .
Note that by random we always mean uniformly at random. We highlight that the first three operations in general increase objectives such as network's size or energy consumption, but likely also decrease objectives such as the error, while the last three operations in general decrease the firstly mentioned objectives, but increase the lastly objectives. Consequently, these mutations are suitable for multiple, opposing objectives. To further speed up NAS, the authors of LEMONADE propose to apply these mutations as network morphisms [14, 95] . Network morphisms are function-preserving operators on neural networks, i.e. a network morphism maps a neural network N w with weights w to another neural networkÑw with weightsw so that for every input x to the network N w (x) =Ñw(x). Effectively this means that, when utilizing network morphisms as mutations to generate children, children do not need to be trained from scratch but rather just fine-tuned as children by design have the same error as their parent. This can be interpreted as Lamarckian inheritance in the context of evolutionary algorithms, where Lamarckism refers to a mechanism which allows passing skills acquired during an individual's lifetime (e.g. by means of learning), on to children by means of inheritance. The equality N w (x) =Ñw(x) can be achieved by properly choosingw. For example, if one wants to insert a linear layer at an arbitrary position in a network, equality can be achieved by simply initializing the linear layer as an identity mapping. Mutations 1-3 from above can all be formulated as a network morphism (see [30] for details). Mutations 4-6, on the other hand, cannot be framed as network morphisms, as they all generally decrease the network's capacity and equality cannot be guaranteed. Instead, Elsken et al. [30] propose approximate network morphisms to find proper initialization for these cases. Approximate network morphisms essentially copy the weights of layers not affected by structural changes and train affected layers via knowledge distillation [37] .
Fixed-point quantization
Neural network training algorithms usually rely on data representations and computations with high numerical precision, for example a 32-bit floating-point format, typically used in graphics processing units (GPUs). However, after training, a reduced-precision number format can be used for inference on a dedicated DNN accelerator to reduce energy consumption and bandwidth [54] . In this context, an 8-bit fixed-point format is a common choice in embedded and mobile devices [43] . Hence, to deploy a DNN on an embedded device after training on a GPU, weights, biases and activations need to be transformed from a floatingpoint to a fixed-point number format. This procedure is denoted by network quantization. We apply network quantization as post-processing step after neural architecture search with LEMONADE.
To quantize a real value χ to a signed fixed-point value χ q using B bit, we determine
where ∆ denotes the step size, i. e. the smallest distance between two quantization sampling points of χ.
In other words, ∆ corresponds to the value of the least significant bit (LSB). In [92] , a simple method to find a suitable step size for a given data distribution in DNNs with sigmoid activations is introduced. It determines the step size ∆ based on the maximum range of a distribution according to
In the following, we refer to this quantization method as MaxRange. However, modern DNNs commonly use unbounded activation functions, such as ReLU, and thus may entail data distributions with far outliers. Since the quantization range is adapted to the maximum value, the step size ∆ is maximal and consequently leads to a coarse sampling of smaller values. Moreover, as data distributions in DNNs typically follow a Gaussian distribution, (13) leads to a coarse sampling of a large number of values. A quantization method which specifically targets this problem has been introduced in [94] . Here, parameters and activations are quantized by minimizing the effect of the quantization error δ = χ − χ q in the network. In a neural network, the output value y of a neuron with a rectifying unit Φ(·), bias b, weights w and input values x is determined by
For the purpose of measuring the influence of the quantization error of inputs (δ x ), weights (δ w ) and biases (δ b ), we defineỹ as the resulting neuron output when quantities of (14) are quantized. More precisely,ỹ w is defined as the neuron output determined with quantized weights w q where activations and biases remain in a 32 bit floating-point number format.ỹ x andỹ b are defined accordingly. The step sizes ∆ (l) are then individually determined for each layer by 
We additionally constrain the step sizes to a power-oftwo value, i. e. ∆ ∈ {2 z | z ∈ Z}, to enable a direct fixedpoint operation in a hardware accelerator. In the rest of the paper, this quantization method is referred to as minimal propagated quantization error (MinPQE).
Experiments

Experimental setup
To evaluate our methods, we use two common image classification benchmarks. Firstly, CIFAR-10 [49] is used, which consists of 32 × 32 pixel RGB images divided into ten distinct classes. The samples are divided into 50 000 training and 10 000 test samples. Out of the training set, 5000 samples are used for validation during neural architecture search. Secondly, GTSRB [86] is used, which contains RGB images of 43 different types of traffic signs. The images of this benchmark are scaled to a resolution of 48 × 48 pixels before they are fed into the classifier. The dataset has 39 210 training samples, out of which 4010 are separated for classification validation. An additional set of 12 630 images is used for measuring final test error rates.
Unless otherwise noted, we use the same hyperparameter setup for both benchmarks. We run LEMONADE for 300 evolutionary iterations. The algorithm is initialized with a population of 15 manually chosen trivial network architectures with different numbers of convolutional layers and kernel shapes. For DNN training, we use stochastic gradient descent (SGD) with cosine annealing [59] , momentum of 0.9 and a weight decay of 0.0005. The learning rate for each training phase during architecture search is initialized with 0.01. The training batch size is set to 64 throughout our experiments. Furthermore, we apply commonly used data augmentations during training [59] . However, we leave out horizontal image flips for GTSRB, since they would change the meaning of some traffic signs. In addition, we use mixup [105] and cutout [23] for further training data augmentation.
The final population sizes of the CIFAR-10 and GT-SRB models are 439 and 238, respectively. From each of these, the 50 architectures with best validation error rates are selected and each of these is trained from scratch on the set of training and validation images for 200 epochs. The learning rate is initialized with 0.025 in this case and all other hyperparameters stay the same. Classification error is evaluated on the separate test set after the training. Subsequently, we quantize the networks' weights and activations to an 8-bit fixedpoint representation using the MaxRange and MinPQE methods described in Section 3.3 for further evaluations.
Error simulations
Random bit-flip error simulations are used to evaluate the actual resilience of the obtained set of neural networks. For this purpose, we use the fault simulation framework that has been previously described in [82] . The framework builds up on the Keras [19] DNN library with TensorFlow back-end [1] . This allows for performing fast bit-level fault injections in the neuron activation outputs (feature maps) of a CNN. Most of the computation workload required for the simulation can be efficiently computed on a GPU. The framework automatically adds some operations behind each neuron output stage of a given CNN, which emulate a fixedpoint format and allow for a bit-wise fault injection in the neuron output memory by applying a definable Boolean fault mask (see Fig. 1 ). It can be seen in Table 1 that choosing a DNN with minimal cost in one objective often leads to the outcome that at least one other objective is close to its worst value. This is especially the case for CIFAR-10, where f (N ) ∞ is 1 or close to 1 for all single-objective optimizers, BestASI, BestValErr, BestEfficiency, and BestADCR. The optimal trade-off models (BalOpt), however, come quite close to the ideal point, with normalized distances of 0.371 (CIFAR-10) and 0.267 (GT-SRB).
Results
Trade-off analysis between objectives
Another aspect visible in Table 1 is that 8-bit quantization does not significantly increase test set classification error rates of the models in comparison to the 32-bit float case (in some cases the error is even smaller after quantization). The differences between the MaxRange and MinPQE quantization methods with respect to test error rate are marginal.
The resulting distributions of objective values for all 50 models that were selected after the optimization with LEMONADE are shown in Fig. 2 and Fig. 3 for CIFAR-10 and GTSRB, respectively. The sub-figures (a)-(d) each depict the outcomes of f ASI (N ) versus one of the other objective functions. It can be seen that the WorstASI models have comparatively few operations and data transfers. However, the reverse is not always true, since there are models with few operations and data transfers as well as low ASI. In other words, it is possible to have high efficiency and high error resilience at the same time.
Another interesting aspect visible in Fig. 2 (d) and Fig. 3 (d) is a correlation between ADCR and ASI. Consequently, a low ratio of data transfers to operations is not only beneficial for limiting the required bandwidth of the DNN accelerator, but also helps to reduce error sensitivity. This aspect becomes also apparent in Fig. 4 . It can be seen that models with more operations typically also require more data transfers. However, the BestASI models have a relatively high number of operations in comparison to their data transfers, as they are located offside the main trend in the scatter plot.
Evaluation of resilience prediction
We now evaluate the predictive performance of our ASI metric by performing bit-flip fault injections using the framework described in Section 4.2. Bit-flips are randomly injected in all convolutional layer feature map outputs (after ReLU activation and pooling, where applicable) that are written to memory. MinPQE quantization with 8 bits is used, except where otherwise specified. The value of each bit in the feature map outputs is toggled with a probability given by a defined BER. To get statistically meaningful results [52] , random fault locations are sampled n = 200 times and for each trial the effect on the classification output of the network is measured using the complete test set of the respective benchmark. For this purpose, the classification change rate (CCR), i.e. the fraction of images in the test set that are classified differently after the fault injection, is calculated. The sample mean of CCR over all n = 200 trials is reported. This can be interpreted as expected probability of SDC at the given BER.
The results of a linear least-squares regression on the ASI versus CCR value pairs of the 50 optimized models for each benchmark are shown in Fig. 5 . A BER of 0.003 was used for bit-flip injections. A correlation coefficient R = 0.741 is achieved for CIFAR-10 and R = 0.898 for GTSRB. While this indicates that the prediction is not 100% accurate, the correlation is relatively strong. This is especially surprising, considering the fact that ASI is completely determined by the architecture of the neural network and does not require any cumbersome measure- ments based on test data or weight parameters. Thus, we argue that ASI is an efficient and useful metric to guide NAS towards more resilient DNN architectures.
We also evaluate CCRs for varying BERs for a subset of models. The results for CIFAR-10 and GTSRB are plotted in Fig. 6 and Fig. 7 , respectively. An approximately linear dependency between BER and CCR can be observed at very low bit error rates. At higher BERs a transition first to a rapid growth of CCR (note the log scales) is visible and then the value saturates at a value corresponding to chance probability of choosing the same label after fault injection. An interesting finding observable in Fig. 6 and Fig. 7 is that the BestValErr models exhibit an unexpectedly low CCR at low BERs, while they degrade less gracefully (much steeper increase CCR) at high BERs. In the case of GTSRB BestValErr is actually, despite its higher ASI, much more resilient than BestASI at low BERs. An explanation might be that a good baseline classification performance adds an extra degree of error resilience, which is not captured by ASI. The steeper increase, on the other hand, could be due to an overfitting to the task (i.e. weaker ability for generalization).
Comparison of quantization methods
Finally, we compare the MaxRange and MinPQE quantization methods (see Section 3.3), with respect to resulting CCRs after bit-flip fault injections with a BER of 0.005. Results are shown in Fig. 8 and Fig. 9 . The models are sorted in ascending order of CCR after Min-PQE quantization in these figures. It can be seen that MaxRange results in a significantly worse CCR in most of the cases. This can be explained by the fact that MaxRange tends to quantize values to a larger range, which is determined by far outliers, while these outliers are ignored (i.e. clipped) by MinPQE. Consequently, MaxRange leads to a weaker signal-to-noise ratio compared to MinPQE in the case of bit-flip errors. We thus argue that MinPQE is the preferable method, since it achieves both, low baseline classification error rates as well as high error resilience.
Conclusions
We have introduced a method for hardware-focused and automated neural architecture design. Our proposed hardware-specific objective functions, which only require network topology information for their evaluation, enable a fast design space exploration and finding of Pareto-optimal solutions of the NAS algorithm. This makes our method efficient and applicable also for more complex classification benchmarks than the ones considered in this paper. We verified the accuracy of resilience prediction with memory bit-flip simulations and found it to be reasonably accurate to guide our NAS algorithm towards architectural resilience optimization. Joint resilience, efficiency, and performance optimization has not been considered in the context of NAS before. Finally, our findings about the influence of different quantization techniques on DNN error resilience highlight the importance of choosing an optimization technique that fosters a high signal-to-noise ratio to limit the influence of bit-flip errors.
