FTT-NAS: Discovering Fault-Tolerant Neural Architecture by Ning, Xuefei et al.
1FTT-NAS: Discovering Fault-Tolerant Neural
Architecture
Xuefei Ning, Guangjun Ge, Wenshuo Li, Zhenhua Zhu, Yin Zheng, Xiaoming Chen, Member, IEEE, Zhen Gao,
Yu Wang, Senior Member, IEEE, and Huazhong Yang, Fellow, IEEE
Abstract—With the fast evolvement of embedded deep-learning
computing systems, applications powered by deep learning are
moving from the cloud to the edge. When deploying neural
networks (NNs) onto the devices under complex environments,
there are various types of possible faults: soft errors caused
by cosmic radiation and radioactive impurities, voltage insta-
bility, aging, temperature variations, and malicious attackers.
Thus the safety risk of deploying NNs is now drawing much
attention. In this paper, after the analysis of the possible
faults in various types of NN accelerators, we formalize and
implement various fault models from the algorithmic perspective.
We propose Fault-Tolerant Neural Architecture Search (FT-NAS)
to automatically discover convolutional neural network (CNN)
architectures that are reliable to various faults in nowadays
devices. Then we incorporate fault-tolerant training (FTT) in
the search process to achieve better results, which is referred to
as FTT-NAS. Experiments on CIFAR-10 show that the discov-
ered architectures outperform other manually designed baseline
architectures significantly, with comparable or fewer floating-
point operations (FLOPs) and parameters. Specifically, with the
same fault settings, F-FTT-Net discovered under the feature fault
model achieves an accuracy of 86.2% (VS. 68.1% achieved by
MobileNet-V2), and W-FTT-Net discovered under the weight
fault model achieves an accuracy of 69.6% (VS. 60.8% achieved
by ResNet-20). By inspecting the discovered architectures, we find
that the operation primitives, the weight quantization range, the
capacity of the model, and the connection pattern have influences
on the fault resilience capability of NN models.
I. INTRODUCTION
CONVOLUTIONAL Neural Networks (CNNs) haveachieved breakthroughs in various tasks, including clas-
sification [1], detection [2] and segmentation [3], etc. Due
to their promising performance, CNNs have been utilized in
various safety-critic applications, such as autonomous driving,
intelligent surveillance, and identification. Meanwhile, driven
by the recent academic and industrial efforts, the neural net-
work accelerators based on various hardware platforms (e.g.,
This work was supported by National Key R&D Program of China (No.
2016YFB0800900), a 973 project; National Natural Science Foundation of
China (No. 61532017, 61621091); Xilinx and Beijing Innovation Center
for Future Chips; the project of Tsinghua University and Toyota Joint
Research Center for AI Technology of Automated Vehicle (TT2019-01);
Beijing Academy of Artificial Intelligence (No. BAAI2019QN0402).
X. Ning, G. Ge, W. Li, Z. Zhu, Y. Wang, H. Yang are with Department
of Electronic Engineering, Tsinghua University, Beijing, China, and Beijing
National Research Center for Information Science and Technology (BNRist)
(e-mail: yu-wang@tsinghua.edu.cn).
Y. Zheng is with Weixin Group, Tencent, Beijing, China.
X. Chen is with State Key Laboratory of Computer Architecture, Institute
of Computing Technology, Chinese Academy of Sciences, Beijing, China.
Z. Gao is with School of Electrical and Information Engineering, Tianjin
University, Tianjin, China.
Application Specific Integrated Circuits (ASIC) [4], Field
Programmable Gate Array (FPGA) [5], Resistive Random-
Access Memory (RRAM) [6]) have been rapidly evolving.
The robustness and reliability related issues of deploying
neural networks onto various embedded devices for safety-
critical applications are attracting more and more attention.
There is a large stream of algorithmic studies on various
robustness-related characteristics of NNs, e.g., adversarial ro-
bustness [7], data poisoning [8], interpretability [9] and so
on. However, no hardware models are taken into consider-
ation in these studies. Besides the issues from the purely
algorithmic perspective, there exist hardware-related reliability
issues when deploying NNs onto the nowadays embedded
devices. With the down-scaling of CMOS technology, circuits
become more sensitive to cosmic radiation and radioactive
impurities [10]. Voltage instability, aging, and temperature
variations are also common effects that could lead to errors.
As for the emerging metal-oxide RRAM devices, due to the
immature technology, they suffer from many types of device
faults [11], among which hard faults such as Stuck-at-Faults
(SAFs) damage the computing accuracy severely and could
not be easily mitigated [12]. Moreover, malicious attackers
can attack the edge devices by embedding hardware Trojans,
manipulating back-doors, and doing memory injection [13].
Recently, some studies [14], [15], [16] analyzed the sensi-
tivity of NN models. They proposed to predict whether a layer
or a neuron is sensitive to faults and protect the sensitive ones.
For fault tolerance, a straightforward way is to introduce re-
dundancy in the hardware. Triple Modular Redundancy (TMR)
is a commonly used but expensive method to tolerate a single
fault [17], [18], [19]. Studies [12], [14] proposed various re-
dundancy schemes for Stuck-at-Faults tolerance in the RRAM-
based Computing Systems. For increasing the algorithmic fault
resilience capability, studies [20], [21] proposed to use fault-
tolerant training (FTT), in which random faults are injected in
the training process.
Although redesigning the hardware for reliability is effec-
tive, it is not flexible and inevitably introduces large overhead.
It would be better if the issues could be mitigated as far as
possible from the algorithmic perspective. Existing methods
mainly concerned about designing training methods and an-
alyzing the weight distribution [16], [20], [21]. Intuitively,
the neural architecture might also be important for the fault
tolerance characteristics [22], [23], since it determines the
“path” of fault propagation. To verify these intuitions, the
accuracies of baselines under a random bit-bias feature fault
ar
X
iv
:2
00
3.
10
37
5v
1 
 [e
es
s.S
P]
  2
0 M
ar 
20
20
2model1 are shown in Table I, and the results under SAF
weight fault model2 are shown in Table II. These preliminary
experiments on the CIFAR-10 dataset show that the fault tol-
erance characteristics vary among neural architectures, which
motivates the employment of the neural architecture search
(NAS) technique into the designing of fault-tolerant neural
architectures. We emphasize that our work is orthogonal to
most of the previous methods based on hardware or mapping
strategy design. To our best knowledge, our work is the
first to increase the algorithmic fault resilience capability by
optimizing the NN architecture.
TABLE I: Performance of the baseline models with random
bit-bias feature faults. 0/10−5/10−4 denotes the per-MAC
fault rate.
Model Acc(0/10−5/10−4) #Params #FLOPs
ResNet-20 94.7/63.4/10.0 11.2M 1110M
VGG-16† 93.1/21.4/10.0 14.7M 626M
MobileNet-V2 92.3/10.0/10.0 2.3M 182M
†: For simplicity, we only keep one fully-connected layer of VGG-16.
TABLE II: Performance of the baseline models with SAF
weight faults. 0/4%/8% denotes the sum of the SAF1 and
SAF0 rates.
Model Acc(0/4%/8%) #Params #FLOPs
ResNet-20 94.7/64.8/17.8 11.2M 1110M
VGG-16 93.1/45.7/14.3 14.7M 626M
MobileNet-V2 92.3/26.2/11.7 2.3M 182M
In this paper, we employ NAS to discover fault-tolerant
neural network architectures against feature faults and weight
faults, and demonstrate the effectiveness by experiments. The
main contributions of this paper are as follows.
• We analyze the possible faults in various types of
NN accelerators (ASIC-based, FPGA-based, and RRAM-
based), and formalize the statistical fault models from
the algorithmic perspective. After the analysis, we adopt
the Multiply-Accumulate (MAC)-i.i.d Bit-Bias (MiBB)
model and the arbitrary-distributed Stuck-at-Fault (ad-
SAF) model in the neural architecture search for toler-
ating feature faults and weight faults, respectively.
• We establish a multi-objective neural architecture search
framework. On top of this framework, we propose two
methods to discover neural architectures with better
reliability: FT-NAS (NAS with a fault-tolerant multi-
objective), and FTT-NAS (NAS with a fault-tolerant
multi-objective and fault-tolerant training (FTT)).
• We employ FT-NAS and FTT-NAS to discover archi-
tectures for tolerating feature faults and weight faults.
The discovered architectures, F-FTT-Net and W-FTT-
Net have comparable or fewer floating-point operations
(FLOPs) and parameters, and achieve better fault re-
silience capabilities than the baselines. With the same
fault settings, F-FTT-Net discovered under the feature
1The random bit-bias feature fault model is formalized in Sec. III-D.
2The SAF weight fault model is formalized in Sec. III-E.
fault model achieves an accuracy of 86.2% (VS. 68.1%
achieved by MobileNet-V2), and W-FTT-Net discovered
under the weight fault model achieves an accuracy of
69.6% (VS. 60.8% achieved by ResNet-20). The ability
of W-FTT-Net to defend against several other types of
weight faults is also illustrated by experiments.
• We analyze the discovered architectures, and discuss how
the weight quantization range, the capacity of the model,
and the connection pattern influence the fault resilience
capability of a neural network.
The rest of this paper is organized as follows. The related
studies and the preliminaries are introduced in Section II. In
Section III, we conduct comprehensive analysis on the possible
faults and formalize the fault models. In Section IV, we
elaborate on the design of the fault-tolerant NAS system. Then
in Section V, the effectiveness of our method is illustrated by
experiments, and the insights are also presented. Finally, we
discuss and conclude our work in Section VI and Section VII.
II. RELATED WORK AND PRELIMINARY
A. Convolutional Neural Network
Usually, a convolutional neural network is constructed by
stacking multiple convolution layers and optional pooling
layers, followed by fully-connected layers. Denoting the in-
put feature map (IFM), before-activation output feature map,
output feature map (OFM, i.e. activations), weights and bias
of i-th convolution layer as xi, fi, yi, Wi, bi, the computation
can be written as:
fi =Wi ~ xi + bi
yi = g(fi)
(1)
where ~ is the convolution operator, g(·) is the activation
function, for which ReLU g(x) = max(x, 0) is the commonest
choice. From now on, we omit the i subscript for simplicity.
B. NN Accelerators and Fixed-point Arithmetic
With dedicated data flow design for efficient neural net-
work processing, FPGA-based NN accelerators could achieve
at least 10x better energy efficiency than GPUs [5], [24].
And ASIC-based accelerators could achieve even higher ef-
ficiency [4]. Besides, RRAM-based Computing Systems (RC-
Ses) are promising solutions for energy-efficient brain-inspired
computing [6], due to their capability of performing matrix-
vector-multiplications (MVMs) in memory. Existing studies
have shown RRAM-based Processing-In-Memory (PIM) ar-
chitectures can enhance the energy efficiency by over 100×
compared with both GPU and ASIC solutions, as they can
eliminate the large data movements of bandwidth-bounded
NN applications [6]. For the detailed and formal hardware
architecture descriptions, we refer the readers to the references
listed above.
Currently, fixed-point arithmetic units are implemented by
most of the NN accelerators, as 1) they consume much fewer
resources and are much more efficient than the floating-point
ones [24]; 2) NN models are proven to be insensitive to
quantization [5], [25]. Consequently, quantization is usually
3applied before a neural network model is deployed onto the
edge devices. To keep consistent with the actual deploying
scenario, our simulation incorporates 8-bit dynamic fixed-point
quantization for the weights and activations. More specifically,
independent step sizes are used for the weights and activations
of different layers. Denoting the fraction length and bit-
width of a tensor as l and Q, the step size (resolution) of
the representation is 2−l. For common CMOS platforms, in
which complement representation is used for numbers, the
representation range of both weights and features is
[−2Q−l, 2−l(2Q − 1)] (2)
As for RRAM-based NN platforms, two separate crossbars
are used for storing positive and negative weights [6]. Thus
the representation range of the weights is
[−Rw, Rw] = [−2−l(2Q+1 − 1), 2−l(2Q+1 − 1)] (3)
For the feature representation in RRAM-based platforms, by
assuming that the Analog to Digital Converters (ADCs) and
Digital to Analog Converters (DACs) have enough precision,
and the CMOS bit-width is Q-bit, the representation range of
features in CMOS circuits is
[−Rf , Rf ] = [−2Q−l, 2−l(2Q − 1)] (4)
C. Fault Resilience for CMOS-based Accelerators
[10], [26], [27] revealed that advanced nanotechnology
makes circuits more vulnerable to soft errors. Unlike hard
errors, soft errors do not damage the underlying circuits, but
instead trigger an upset of the logic state. The dominant cause
of soft errors in CMOS circuits is the radioactive events, in
which a single particle strikes an electronic device. [22], [28]
explored how the Single-Event Upset (SEU) faults impact the
FPGA-based CNN computation system.
TMR is a commonly used approach to mitigate SEUs [17],
[18], [19]. Traditional TMR methods are agnostic of the NN
applications and introduce large overhead. To exploit the NN
applications’ characteristics to reduce the overhead, one should
understand the behavior of NN models with computational
faults. [15] analyzed the layer-wise sensitivity of NN models
under two hypothetical feature fault models. [28] proposed to
only triplicate the vulnerable layers after layer-wise sensitivity
analysis and reduced the LUTs overhead for an NN model on
Iris Flower from about 200% (TMR) to 50%. [16] conducted
sensitivity analysis on the individual neuron level. [23] found
that the impacts and propagation of computational faults in an
NN computation system depend on the hardware data path,
the model topology, and the type of layers. These methods
analyzed the sensitivity of existing NN models at different
granularities and exploited the resilience characteristics to
reduce the hardware overhead for reliability. Our methods
are complementary and discover NN architectures with better
algorithmic resilience capability.
To avoid the accumulation of the persistent soft errors
in FPGA configuration registers, the scrubbing technique is
applied by checking and partially reloading the configuration
bits [17], [29]. From the algorithmic perspective, [21] demon-
strated the effectiveness of fault-tolerant training (FTT) in the
presence of SRAM bit failures.
D. Fault Resilience for RRAM-based Accelerators
RRAM devices suffer from lots of device faults [11],
among which the commonly occurring SAFs are shown to
cause severe degradation in the performance of mapped neural
networks [12]. RRAM cells containing SAF faults get stuck
at high-resistance state (SAF0) or low-resistance state (SAF1),
thereby causing the weight to be stuck at the lowest or
highest magnitudes of the representation range, respectively.
Besides the hard errors, resistance programming variation [30]
is another source of faults for NN applications [31].
For the detection of SAFs, [32], [33] proposed fault de-
tection methods that can provide high fault coverage, [34]
proposed on-line fault detection method that can periodically
detect the current distribution of faults.
Most of the existing studies on improving the fault resilience
ability of RRAM-based neural computation system focus on
designing the mapping and retraining methods. [12], [14],
[34], [35] proposed different mapping strategies and the corre-
sponding hardware redundancy design. After the distribution
detection of the faults and variations, they proposed to retrain
(i.e. finetune) the NN model for tolerating the detected faults,
which is exploiting the intrinsic fault resilience capability of
NN models. To overcome the programming variations, [31]
calculated the calibrated programming target weights with the
log-normal resistance variation model, and proposed to map
sensitive synapses onto cells with small variations. From the
algorithmic perspective, [36] proposed to use error-correcting
output codes (ECOC) to improve the NN’s resilience capability
for tolerating resistance variations and SAFs.
E. Neural Architecture Search
Neural Architecture Search, as an automatic neural network
architecture design method, has been recently applied to design
model architectures for image classification and language
models [37], [38], [39]. The architectures discovered by the
NAS techniques have demonstrated surpassing performance
than the manually designed ones. NASNet [37] used a recur-
rent neural network (RNN) controller to sample architectures,
trained them, and used the final validation accuracy to instruct
the learning of the controller. Instead of using reinforcement
learning (RL)-learned RNN as the controller, [39] used a
relaxed differentiable formulation of the neural architecture
search problem, and applied gradient-based optimizer for
optimizing the architecture parameters; [40] used evolutionary-
based methods for sampling new architectures, by mutating
the architectures in the population. Although NASNet [37] is
powerful, the search process is extremely slow and compu-
tationally expensive. To address this pitfall, a lot of methods
are proposed to speed up the performance evaluation in NAS.
[41] incorporated learning curve extrapolation to predict the
final performance after a few epochs of training; [40] sampled
architectures using mutation on existing models and initialized
the weights of the sampled architectures by inheriting from the
4parent model; [38] shared the weights among different sampled
architectures, and using the shared weights to evaluate each
sampled architecture.
The goal of the NAS problem is to discover the architecture
that maximizes some predefined objectives. The process of the
original NAS algorithm goes as follows. At each iteration, α
is sampled from the architecture search space A. This archi-
tecture is then assembled as a candidate network Net(α,w),
where w is the weights to be trained. After training the weights
w on the training data split Dt, the evaluated reward of the
candidate network on the validation data split Dv will be used
to instruct the sampling process. In its purest form, the NAS
problem can be formalized as:
maxα∈A Exv∼Dv [R(xv,Net(α,w
∗(α)))]
s.t. w∗(α) = argminwExt∼Dt [L(xt,Net(α,w))]
(5)
where ∼ is the sampling operator, Ex∼D[·] denotes the expec-
tation with respect to the data distribution D, R denotes the
evaluated reward used to instruct the sampling process, and
L denotes the loss criterion for back propagation during the
training of the weights w.
Originally, for the performance evaluation of each sampled
architecture α, one needs to find the corresponding w∗(α)
by fully training the candidate network from scratch. This
process is extremely slow, and shared weights evaluation is
commonly used for accelerating the evaluation. In shared
weights evaluation, each candidate architecture α is a subgraph
of a super network and is evaluated using a subset of the super
network weights. The shared weights of the super network are
updated along the search process.
III. FAULT MODELS
In Sec. III-A, we motivate and discuss the formalization
of application-level statistical fault models. Platform-specific
analysis are conducted in Sec. III-B and Sec. III-C. Fi-
nally, the MAC-i.i.d Bit-Bias (MiBB) feature fault model and
the arbitrary-distributed Stuck-at-Fault model (adSAF) weight
fault model are described in Sec. III-D and Sec. III-E, which
would be used in the neural architecture search process. The
analyses in this part are summarized in Fig. 4 (a) and Table III.
A. Application-Level Modeling of Computational Faults
Computational faults do not necessarily result in functional
errors [10], [23]. For example, a neural network for classi-
fication tasks usually outputs a class probability vector, and
our work only regards it as a functional error i.f.f the top-1
decision becomes different from the golden result. Due to the
complexity of the NN computations and different functional
error definition, it’s very inefficient to incorporate gate-level
fault injection or propagation analysis into the training or
architecture search process. Therefore, to evaluate and further
boost the algorithmic resilience of neural networks to com-
putational faults, the application-level fault models should be
formalized.
From the algorithmic perspective, the faults fall into two
categories: weight faults and feature faults. In this section, we
analyze the possible faults in various types of NN accelerators,
and formalize the statistical feature and weight fault models.
A summary of these fault models is shown in Table III.
Note that we focus on the computational faults along the
datapath inside the NN accelerator that could be modeled
and mitigated from the algorithmic perspective. Faults in the
control units and other chips in the system are not considered.
See more discussion in the “limitation of application-level fault
models” section in Sec. VI.
B. Analysis of CMOS-based Platforms: ASIC and FPGA
This part is hard to model and mitigate from the algorithmic 
perspective, and is not considered in this paper.
ASIC
Configuration Register
Control logic
On-chip Memory 
(Weight/feature buffer)
LUT-based 
computational 
logic
Gate-based 
computational 
logic
FPGA
Routing
Bit-flip errors, modeled as i.i.d.-bit-flip (iBF)
Accumulated bias errors, modeled as i.i.d.-bit-bias (iBB)
Fig. 1: The possible error positions in CMOS-based platforms.
The possible errors in CMOS-based platforms are illustrated
in Fig. 1. Soft errors that happen in the memory elements
or the logic elements could lead to transient faulty outputs
in ASICs. Compared with logic elements (e.g., combinational
logic gates, flip-flops), memory elements are more susceptible
to soft errors [27]. An unprotected SRAM cell usually has
a larger bit soft error rate (SER) than flip-flops. Since the
occurring probability of hard errors is much smaller than that
of the soft errors, we focus on the analysis of soft errors,
despite that hard errors lead to permanent failures.
The soft errors in the weight buffer could be modeled as
i.i.d weight random bit-flips. Given the original value as x,
the distribution of a faulty value x′ under the random bit-flip
(BF) model could be written as
x′ ∼ BF(x0; p)
indicates x′ = 2−l(2lx0 ⊕ e), e =
Q∑
q=1
eq2
q−1
eq ∼ Bernoulli(p), q = 1, · · · , Q
(6)
where eq denotes whether a bit-flip occurs at bit position q,
⊕ is the XOR operator.
By assuming that error occurs at each bit with an i.i.d bit
SER of rs, we know that each Q-bit weight has an i.i.d proba-
bility pw to encounter error, and pw = (1−(1−rs)Q) ≈ rs×Q,
as rs × Q  1. It is worth to note that throughout the
analysis, we assume that the SERs of all components  1,
hence the error rate at each level is approximated as the
sum of the error rates of the independent sub-components. As
each weight encounters error independently, a weight tensor
is distributed as i.i.d random bit-flip (iBF): w ∼ iBF(w0; rs),
where w0 is the golden weights. [44] showed that the iBF
5TABLE III: Summary of the NN application-level statistical fault models, due to various types of errors on different platforms.
Headers: H/S refers to Hard/Soft errors; P/T refers to Persistent/Transient influences; F/W refers to Feature/Weight faults.
Platform Error source Errorposition
Logic
component
H/S P/T Common
mitigation
NN application level
F/W Simplified statistical model
RRAM SAF SB-cell Crossbar H P detection+3R[34], [14], [35] W
w ∼ 1bit-adSAF(w0; p0, p1)
MB-cell w ∼ Qbit-adSAF(w0; p0, p1)
RRAM variations MB-cell Crossbar S P
PS loop
[31], [42] W
w ∼ LogNormal(w0;σ),
w ∼ ReciprocalNormal(w0;σ)
FPGA/ASIC SEE, overstress SRAM Weight buffer H P ECC W w ∼ iBF(w0; rh ×Mp(t))SEE, VS S T w ∼ iBF(w0; rs)
FPGA
SEE, overstress
LUTs PE
H
P
TMR
[17], [18], [19]
F
f ∼ iBB(f0; rh ×Ml ×Mp(t))
SEE, VS S
TMR,
Scrubbing [29] f ∼ iBB(f0; rs ×Ml ×Mp(t))
FPGA/ASIC/RRAM SEE, overstress SRAM Feature buffer H P ECC F y ∼ iBF(y0; rh ×Mp(t))SEE, VS S T y ∼ iBF(y0; rs)
ASIC SEE, overstress CL gates,flip-flops PE
H P TMR,
DICE [43] F
f ∼ iBB(f0; rlh ×Ml ×Mp(t))
SEE, VS S T f ∼ iBB(f0; rls ×Ml)
Notations: w, f, y refer to the weights, before-activation features, and after-activation features of a convolution; p0, p1 refer to the SAF0 and
SAF1 rates of RRAM cells; σ refers to the standard deviation of RRAM programming variations; rs, rh refer to the soft and hard error rates
of memory elements, respectively; rls, rlh refer to the soft and hard error rates of logic elements, respectively; Ml is an amplifying coefficient
for feature error rate due to multiple involved computational components; Mp(t) > 1 is a coefficient that abstracts the error accumulation
effects over time.
Abbreviations: SEE refers to Single-Event Errors, e.g. Single-Event Burnout (SEB), Single-Event Upset (SEU), etc.; “overstress” includes
conditions such as high temperature, voltage or physical stress; VS refers to voltage (down)scaling that is used for energy efficiency; SB-cell
and MB-cell refer to single-bit and multi-bit memristor cells, respectively; CL gates refer to combinational logic gates; 3R refers to various
Redundancy schemes and corresponding Remapping/Retraining techniques; PS loop refers to the programming-sensing loop during memristor
programming; TMR refers to Triple Modular Redundancy; DICE refers to Dual Interlocked Cell.
model could capture the bit error behavior exhibited by real
SRAM hardware.
The soft errors in the feature buffer are modeled similarly as
i.i.d random bit-flips, with a fault probability of approximately
rs×Q for Q-bit feature values. The distribution of the output
feature map (OFM) values could be written as y ∼ iBF(y0; rs),
where y0 is the golden results.
FPGA-based implementations are often more vulnerable
to soft errors than their ASIC counterparts [45]. Since the
majority space of an FPGA chip is filled with memory cells,
the overall SER rate is much higher. Moreover, the soft errors
occurring in logic configuration bits would lead to persistent
faulty computation, rather than transient faults as in ASIC
logic. Persistent errors can not be mitigated by simple retry
methods and would lead to statistically significant performance
degradation. Moreover, since the persistent errors would be
accumulated if no correction is made, the equivalent error rate
would keep increasing as time goes on. We abstract this effect
with a monotonic increasing function Mp(t) ≥ 1, where the
subscript p denotes “persistent”, and t denotes the time.
Let us recap how one convolution is mapped onto the
FPGA-based accelerator, to see what the configuration bit
errors could cause on the OFM values. If the dimension of
the convolution kernel is (c, k, k) (channel, kernel height,
kernel width, respectively), there are ck2 − 1 ≈ ck2 additions
needed for the computation of a feature value. We assume
that the add operations are spatially expanded onto adder trees
constructed by LUTs, i.e., no temporal reusing of adders is
used for computing one feature value. That is to say, the
add operations are mapped onto different hardware adders3,
and encounter errors independently. The per-feature error
rate could be approximated by the adder-wise SER times
Ml, where Ml ≈ ck2. Now, let’s dive into the adder-level
computation, in a 1-bit adder with scale s, the bit-flip in one
LUTs bit would add a bias ±2s to the output value, if the
input bit signals match the address of this LUTs bit. If each
LUT cell has an i.i.d SER of rs, in a Q′-bit adder, denoting the
fraction length of the operands and result as l′, the distribution
of the faulty output x′ with the random bit-bias (BB) faults
could be written as
x′ ∼ BB(x0; p,Q′, l′)
indicates x′ = x0 + e, e = 2−l
′
Q′∑
q=1
(−1)β2q−1eq
eq ∼ Bernoulli(p)
βq ∼ Bernoulli(0.5), q = 1, · · · , Q′
(7)
As for the result of the adder tree constructed by multiple
LUT-based adders, since the probability that multiple bit-bias
errors co-occur is orders of magnitude smaller, we ignore the
accumulation of the biases that are smaller than the OFM
quantization resolution 2−l. Consequently, the OFM feature
values before the activation function follow the i.i.d Random
Bit-Bias distribution f ∼ iBB(f0; rs × Ml × Mp(t), Q, l),
where Q and l are the bit-width and fraction length of the
OFM values, respectively.
3See more discussion in the “hardware” section in Sec. VI.
6We can make an intuitive comparison of the equivalent
feature error rates induced by LUTs soft errors and feature
buffer soft errors. As the majority of FPGAs is SRAM-based,
considering the bit SER rs of LUTs cell and BRAM cell to be
close, we can see that the feature error rate induced by LUTs
errors is amplified by Ml × Mp(t). As we have discussed,
Mp(t) ≥ 1,Ml = ck2 > 1, the performance degradation
induced by LUTs errors could be significantly larger than that
induced by feature buffer errors.
C. Analysis of PIM-based Platforms: RRAM as an example
In an RRAM-based Computing System (RCS), compared
with the accompanying CMOS circuits, the RRAM crossbar
is much more vulnerable to various non-ideal factors. In multi-
bit RRAM cells, studies have showed that the distribution of
the resistance due to programming variance is either Gaussian
or Log-Normal [30]. As each weight is programmed as the
conductance of the memristor cell, the weight could be seen
as being distributed as Reciprocal-Normal or Log-Normal.
Besides the soft errors, common hard errors such as SAFs,
caused by fabrication defects or limited endurance, could result
in severe performance degradation [12]. SAFs occur frequently
in nowadays RRAM crossbar: As reported by [11], the overall
SAF ratio could be larger than 10% (p1 = 9.04% for SAF1
and p0 = 1.75% for SAF0) in a fabricated RRAM device. The
statistical model of SAFs in single-bit and multi-bit RRAM
devices would be formalized in Sec. III-E.
As the RRAM crossbars also serve as the computation units,
some non-ideal factors (e.g., IR-drop, wire resistance) could
be abstracted as feature faults. They are not considered in this
work since the modeling of these effects highly depends on the
implementation (e.g., crossbar dimension, mapping strategy)
and hardware-in-the-loop testing [20].
D. Feature Fault Model
As analyzed in Sec. III-B, the soft errors in LUTs are
relatively the more pernicious source of feature faults, as 1)
SER is usually much higher than hard error rate: rs  rh, 2)
these errors are persistent if no correction is made, 3) the per-
feature equivalent error rate is amplified as multiple adders
are involved. Therefore, we use the iBB fault model in our
exploration of mitigating feature faults.
We have f ∼ iBB(f0; rsMlMp(t)), where Ml = ck2, and
the probability of error occurring at every position in the OFM
is p = rsMlMp(t)Q = pmMl, where pm = rsQMp(t) is
defined as the per-MAC error rate. Denoting the dimension
of the OFM as (Co, H,W ) (channel, height, and width,
respectively) and the dimension of each convolution kernel
as (c, k, k), the computation of a convolution layer under this
fault model could be written as
y = g(W ~ x+ b+ θ · 2α−l · (−1)β)
s.t. θ ∼ Bernoulli(p)Co×H×W
α ∼ U{0, · · · , Q− 1}Co×H×W
β ∼ U{0, 1}Co×H×W
(8)
where θ is the mask indicating whether an error occurs at each
feature map position, α represents the bit position of the bias,
β represents the bias sign. Note that this formulation is not
equivalent to the random bit-bias formalization in Eq. 7, and
is adopted for efficient simulation. These two formulations are
close when the odds that two errors take effect simultaneously
is small (pm/Q  1). This fault model is referred to as the
MAC-i.i.d Bit-Bias model (abbreviated as MiBB). An example
of injecting feature faults is illustrated in Fig. 2.
12 14 21 -6
20 2 1 16
6 3 4 7
9 4 3 -1 2 3 12
-1 8 5
15 9 23
SEU
weight
feature
=
=
528 347
336 327
500 351
340 311
output(" = 8, & = 0)
12 14 21 -2
20 10 1 16
6 3 0 7
9 4 3 -1
Fig. 2: An example of injecting feature faults.
Intuitively, convolution computation that needs fewer MACs
might be more immune to the faults, as the equivalent error
rate at each OFM location is lower.
E. Weight Fault Model
As RRAM-based accelerators suffer from a much higher
weight error rate than the CMOS-based ones. The Stuck-at-
Faults in RRAM crossbars are mainly considered for the setup
of the weight fault model. We assume the underlying platform
is RRAM with multi-bit cells, and adopt the commonly-used
mapping scheme, in which separate crossbars are used for
storing positive and negative weights [6]. That is to say,
when an SAF0 fault causes a cell to be stuck at HRS, the
corresponding logical weight would be stuck at 0. When an
SAF1 fault causes a cell to be stuck at LRS, the weight would
be stuck at −Rw if it’s negative, or Rw otherwise.
The computation of a convolution layer under the SAF
weight fault model could be written as
y = g(W ′ ~ x+ b)
s.t. W ′ = (1− θ) ·W + θ · e
θ ∼ Bernoulli(p0 + p1)Co×c×k×k
m ∼ Bernoulli( p1
p0 + p1
)Co×c×k×k
e = Rw sgn(W ) ·m
(9)
where Rw refers to the representation bound in Eq. 3, θ is the
mask indicating whether fault occurs at each weight position,
m is the mask representing the SAF types (SAF0 or SAF1) at
faulty weight positions, e is the mask representing the faulty
target values (0 or ±Rw). Every single weight has an i.i.d
probability of p0 to be stuck at 0, and p1 to be stuck at the
positive or negative bounds of the representation range, for
positive and negative weights, respectively. An example of
injecting weight faults is illustrated in Fig. 3.
Note that the weight fault model, referred to as arbitrary-
distributed Stuck-at-Fault model (adSAF), is much harder to
712 14 21 -6
20 2 1 16
6 3 4 7
9 4 3 -1
2 3 12
0 8 5
15 9 23
2 3 12
0 127 5
15 9 23
SAF-1
weight
feature
=
=
528 347
336 327
786 468
699 806
output
0 0 0
0 0 0
0 0 0
0 0 0
-1 0 0
0 0 0
SAF-0
Positive Negative
(" = 6, & = 0)
Fig. 3: An example of injecting weight faults.
defend against than SAF faults with a specific known defect
map. A neural network model that behaves well under the
adSAF model is expected to achieve high reliability across
different specific SAF defect maps.
The above adSAF fault model assumes the underlying
hardware is multi-bit RRAM devices, adSAFs in single-bit
RRAM devices are also of interest. In single-bit RRAM
devices, multiple bits of one weight value are mapped onto
different crossbars, of which the results would be shifted and
added together [46]. In this case, an SAF fault that occurs in
a cell would cause the corresponding bit of the corresponding
weight to be stuck at 0 or 1. The effects of adSAF faults on a
weight value in single-bit RRAM devices can be formulated
as
w′ = sgn(w)2−l(((¬θ) ∧ 2l|w|) ∨ (θ ∧ e))
θ =
Q∑
q=1
θq2
q−1, e =
Q∑
q=1
mq2
q−1
θq
iid∼ Bernoulli(p0 + p1), q = 1, · · · , Q
mq
iid∼ Bernoulli( p1
p0 + p1
), q = 1, · · · , Q
(10)
where the binary representation of θ indicates whether fault
occurs at each bit position, the binary representation of e repre-
sents the target faulty values (0 or 1) at each bit position if fault
occurs. We will demonstrate that the architecture discovered
under the multi-bits adSAF fault model can also defend against
single-bit adSAF faults and iBF weight faults caused by errors
in the weight buffers of CMOS-based accelerators.
IV. FAULT-TOLERANT NAS
In this section, we present the FTT-NAS framework. We first
give out the problem formalization and framework overview in
Sec. IV-A. Then, the search space, sampling and assembling
process are described in Sec. IV-B and Sec. IV-C, respectively.
Finally, the search process is elaborated in Sec. IV-D.
A. Framework Overview
Denoting the fault distribution characterized by the fault
models as F , the neural network search for fault tolerance
can be formalized as
maxα∈A Exv∼Dv [Ef∼F [R(xv,Net(α,w
∗(α)), f)]]
s.t. w∗(α) = argminwExt∼Dt [Ef∼F [L(xt,Net(α,w), f)]]
(11)
As the cost of finding the best weights w∗ for each architec-
ture α is almost unbearable, we use the shared-weights based
evaluator, in which shared weights are directly used to evaluate
sampled architectures. The resulting method, FTT-NAS, is the
method to solve this NAS problem approximately. And FT-
NAS can be viewed as a degraded special case for FTT-NAS,
in which no fault is injected in the inner optimization of finding
w∗(α).
The overall neural architecture search (NAS) framework is
illustrated in Fig. 4 (b). There are multiple components in the
framework: A controller that samples different architecture
rollouts from the search space; A candidate network is
assembled by taking out the corresponding subset of weights
from the super-net. A shared weights based evaluator eval-
uates the performance of different rollouts on the CIFAR10
dataset, using fault-tolerant objectives.
B. Search Space
The design of the search space is as follows: We use
a cell-based macro architecture, similar to the one used in
[38], [39]. There are two types of cells: normal cell, and
reduction cell with stride 2. All normal cells share the same
connection topology, while all reduction cells share another
connection topology. The layout and connections between cells
are illustrated in Fig. 5.
In every cell, there are B nodes, node 1 and node 2 are
treated as the cell’s inputs, which are the outputs of the two
previous cells. For each of the other B − 2 nodes, two in-
coming connections will be selected and element-wise added.
For each connection, the 11 possible operations are: none; skip
connect; 3x3 average (avg.) pool; 3x3 max pool; 1x1 Conv;
3x3 ReLU-Conv-BN block; 5x5 ReLU-Conv-BN block; 3x3
SepConv block; 5x5 SepConv block; 3x3 DilConv block; 5x5
DilConv block.
The complexity of the search space can be estimated. For
each cell type, there are (11(B−2)×(B−1)!)2 possible choices.
As there are two independent cell types, there are (11(B−2)×
(B − 1)!)4 possible architecture in the search space, which is
roughly 9.5× 1024 with B = 6 in our experiments.
C. Sampling and Assembling Architectures
In our experiments, the controller is a recurrent neural
network (RNN), and the performance evaluation is based on
a super network with shared weights, as used by [38].
An example of the sampled cell architecture is illustrated in
Fig. 6. Specifically, to sample a cell architecture, the controller
RNN samples B − 2 blocks of decisions, one for each node
3, · · · , B. In the decision block for node i, M = 2 input nodes
are sampled from 1, · · · , i − 1, to be connected with node i.
Then M operations are sampled from the 11 basic operation
primitives, one for each of the M connections. Note that the
two sampled input nodes can be the same node j, which will
result in two independent connections from node j to node i.
During the search process, the architecture assembling pro-
cess using the shared-weights super network is straightfor-
ward [38]: Just take out the weights from the super network
corresponding to the connections and operation types of the
sampled architecture.
8Network Architecture
(Rollout)
RNN
Controller
Sample
Candidate
Network
Super-net Evaluator
Reward
Objective Loss
Accuracy
Faulty
Accuracy
Clean Cross 
Entropy
Faulty Cross 
Entropy
FTT-NAS Framework (Sec IV)(b)
Discovered 
Architecture
Fault-Tolerant Training
Fault-Tolerant 
Training 
Fault-Tolerant
NN Model
(c)
Application-level Statistical Fault Model (Sec III)(a)
RRAM-based PIM Accelerator
𝑅"
Device Fault 1:
Stuck-At-Fault (SA0, SA1)
……
- + - + - + - +
…
…
……
𝑉$
𝑉%𝑅" 𝑅" 𝑅"
RRAM
g*, 𝑉%, =.𝑉/* ⋅ 𝑐*,* ,	𝑐*, = −𝑔*,𝑔"
w	~	1𝑏𝑖𝑡𝑎𝑑𝑆𝐴𝐹(𝑤A; 	𝑝%, 𝑝D)
w	~	𝑄𝑏𝑖𝑡𝑎𝑑𝑆𝐴𝐹(𝑤A ;	 𝑝%, 𝑝D)
Device Fault 2:
Resistance Variation
𝑅$GHIJ 𝑅
w	~	𝑅𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙𝑁𝑜𝑟𝑚𝑎𝑙 𝑤A; 	𝜎
w	~	𝐿𝑜𝑔𝑁𝑜𝑟𝑚𝑎𝑙 𝑤A; 	𝜎
REG𝑉$D
REG𝑉$S
𝑉$T REG
…… MAC
REG𝑉$D
REG𝑉$S
𝑉$T REG
…… MAC
SRAM bank
Feature Buffer
…
SRAM bank
SRAM bank
Weight Buffer
…
SRAM bank
Process Elements
REG𝑉$D
REG𝑉$S
𝑉$T REG
…… MAC
SEU in Feature Buffer
SEU in Weight Buffer
y	~	𝑖𝐵𝐹(𝑦A;	 𝑟")
w	~	𝑖𝐵𝐹(𝑤A; 	𝑟")
ASIC CNN Accelerator
Logic gate-based Computing Unit
e.g.,1-bit Full Adder
𝐴
𝐵
𝐶/ 𝑂𝐶Y
𝐴𝐵 𝐶/0 1 SEU at Gates
f	~	𝑖𝐵𝐵(𝑓A; 	𝑟"\ 	𝑀\)
FPGA CNN Accelerator
LUT-based Computing Unit
e.g.,Adder with Carry Chain𝑐/
𝑎/^D𝑏/^D
LUT
𝑎/𝑏/ Carry gates
LUT Carry gates
𝑐/^D 𝑠/𝑠/^D
…
SEU in LUTs
f	~	𝑖𝐵𝐵(𝑓A; 	𝑟"𝑀\𝑀`(𝑡))
Fig. 4: Illustration of the overall workflow. (a) The setup of the application-level statistical fault models. (b) The FTT-NAS
framework. (c) The final fault-tolerant training stage.
D. Searching for Fault-Tolerant Architecture
The FTT-NAS algorithm is illustrated in Alg. 1. To search
for a fault-tolerant architecture, we use a weighted sum of the
clean accuracy and the accuracy with fault injection as the
reward to instruct the training of the controller:
R = (1− αr) ∗ accc + αr ∗ accf (12)
where accf is calculated by injecting faults following the fault
distribution described in Sec. III. For the optimization of the
controller, we employ the Adam optimizer [47] to optimize
the REINFORCE [48] objective, together with an entropy
encouraging regularization.
In every epoch of the search process, we alternatively train
the shared weights and the controller on separate data splits Dt
and Dv , respectively. For the training of the shared weights,
we carried out experiments under two different settings: with-
out/with FTT. When training with FTT, a weighted sum of the
clean cross entropy loss CEc and the cross entropy loss with
fault injection CEf is used to train the shared weights. The
FTT loss can be written as
L = (1− αl) ∗ CEc + αl ∗ CEf (13)
As shown in line 7-12 in Alg. 1, in each step of training
the shared weights, we sample architecture α using the current
controller, then backpropagate using the FTT loss to update
the parameters of the candidate network. Training without FTT
(in FT-NAS) is a special case with αl = 0.
As shown in line 15-20 in Alg. 1, in each step of training
the controller, we sample architecture from the controller,
assemble this architecture using the shared weights, and then
get the reward R on one data batch in Dv . Finally, the reward
is used to update the controller by applying the REINFORCE
technique [48], with the reward baseline denoted as b.
V. EXPERIMENTS
In this section, we demonstrate the effectiveness of the FTT-
NAS framework and analyze the discovered architectures un-
der different fault models. First, we introduce the experiment
setup in Sec. V-A. Then, the effectiveness under the feature
and weight fault models are shown in Sec. V-B and Sec. V-C,
respectively. The effectiveness of the learned controller is
illustrated in Sec. V-D. Finally, the analyses and illustrative
experiments are presented in Sec. V-E.
9Stem
Normal Cell
Reduce
Normal Cell
Reduce
Avgpool + 
FC
Normal Cell
Normal Cell
Normal Cell
Normal Cell
preprocess preprocess
Input node
Output node
Operation type
x None
x Skip connect
x 3x3 average pooling
x 3x3 max pooling
x 1x1 Conv
x 3x3 ReLU-Conv-BN block
x 5x5 ReLU-Conv-BN block
x 3x3 SepConv block
x 5x5 SepConv block
x 3x3 DilConv block
x 5x5 DilConv block
1 2
3
4
5
6
Input node
Fig. 5: Illustration of the search space design. Left: The
layout and connections between cells. Right: The possible
connections in each cell, and the possible operation types on
every connection.
preprocess
Input node
Output node
2
5
6
1
3
4
Avg 3x3 R-C-B 3x3
R-C-B 5x5
skip
R-C-B 3x3
Conv 1x1
R-C-B 5x5
None
Input node
preprocess
Fig. 6: An example of the sampled cell architecture.
A. Setup
Our experiments are carried out on the CIFAR-10 [49]
dataset. CIFAR-10 is one of the most commonly used com-
puter vision datasets and contains 60000 32×32 RGB images.
Three manually designed architectures VGG-16, ResNet-20,
and MobileNet-V2 are chosen as the baselines. 8-bit dynamic
fixed-point quantization is used throughout the search and
training process, and the fraction length is found following
the minimal-overflow principle.
In the neural architecture search process, we split the train-
ing dataset into two subsets. 80% of the training data is used
to train the shared weights, and the remaining 20% is used to
train the controller. The super network is an 8-cell network,
with all the possible connections and operations. The channel
number of the first cell is set to 20 during the search process,
and the channel number increases by 2 upon every reduction
cell. The controller network is an RNN with one hidden layer
Algorithm 1 FTT-NAS
1: EPOCH: the total search epochs
2: w: shared weights in the super network
3: θ: the parameters of the controller
4: epoch = 0
5: while epoch < EPOCH do
6: for all xt, yt ∼ Dt do
7: a ∼ pi(a; θ)
8: f ∼ F (f)
9: Lc = CE(Net(a;w)(xt), yt) # clean cross entropy
10: Lf = CE(Net(a;w)(xt, f), yt) #
faulty cross entropy
11: L(xt, yt,Net(a;w), f) = (1− αl)Lc + αlLf
12: w = w − ηw∇wL
13: end for
14: for all xv, yv ∼ Dv do
15: a ∼ pi(a; θ)
16: f ∼ F (f)
17: Rc = Acc(Net(a;w)(xv), yv) # clean accuracy
18: Rf = Acc(Net(a;w)(xv, f), yv) #
faulty accuracy
19: R(xv, yv,Net(a;w), f) = (1− αr) ∗Rc + αr ∗Rf
20: θ = θ + ηθ(R− b)∇θ log pi(a; θ)
21: end for
22: epoch = epoch + 1
23: schedule ηw, ηθ
24: end while
25: return a ∼ pi(a; θ)
of size 100. The learning rate for training the controller is 1e-3.
The reward baseline b is updated using a moving average with
momentum 0.99. To encourage exploration, we add an entropy
encouraging regularization to the controller’s REINFORCE
objective, with a coefficient of 0.01. For training the shared
weights, we use an SGD optimizer with momentum 0.9 and
weight decay 1e-4, the learning rate is scheduled by a cosine
annealing scheduler [50], started from 0.05. Each architecture
search process is run for 100 epochs. Note that all these are
typical settings that are similar to [38]. We build the neural
architecture search framework and fault injection framework
upon the PyTorch framework.
B. Defend Against MiBB Feature Faults
As described in Sec. IV, we conduct neural architecture
searching without/with fault-tolerant training (i.e., FT-NAS
and FTT-NAS, correspondingly). The per-MAC injection prob-
ability pm used in the search process is 1e-4. The reward
coefficients αr in Eq. 12 is set to 0.5. In FTT-NAS, the
loss coefficient αl in Eq. 13 is also set to 0.5. As the
baselines for FT-NAS and FTT-NAS, we train ResNet-20,
VGG-16, MobileNet-V2 with both normal training and FTT.
For each model trained with FTT, we successively try per-
MAC fault injection probability pm in {3e-4, 1e-4, 3e-5},
and use the largest injection probability with which the model
could achieve a clean accuracy above 50%. Consequently, the
10
TABLE IV: Comparison of different architectures under the MiBB feature fault model
Arch Training† Cleanaccuracy
Accuracy with feature faults (%) #FLOPs #Params3e-6 1e-5 3e-5 1e-4 3e-4
ResNet-20 clean 94.7 89.1 63.4 11.5 10.0 10.0 1110M 11.16M
VGG-16 clean 93.1 78.2 21.4 10.0 10.0 10.0 626M 14.65M
MobileNet-V2 clean 92.3 10.0 10.0 10.0 10.0 10.0 182M 2.30M
F-FT-Net clean 91.0 71.3 22.8 10.0 10.0 10.0 234M 0.61M
ResNet-20 pm=1e-4 79.2 79.1 79.6 78.9 60.6 11.3 1110M 11.16M
VGG-16 pm=3e-5 83.5 82.4 77.9 50.7 11.1 10.0 626M 14.65M
MobileNet-V2 pm=3e-4 71.2 70.3 69.0 68.7 68.1 47.8 182M 2.30M
F-FTT-Net pm=3e-4 88.6 88.7 88.5 88.0 86.2 51.0 245M 0.65M
†: As also noted in the main text, for all the FTT trained models, we successively try per-MAC fault injection
probability pm in {3e-4, 1e-4, 3e-5}, and use the largest injection probability with which the model could achieve
a clean accuracy above 50%.
normal 
c_{k-2}
0
sep_conv_5x5
sep_conv_5x5
2
sep_conv_5x5
3
dil_conv_5x5
c_{k-1}
1
avg_pool_3x3
skip_connect
sep_conv_5x5
c_{k}
dil_conv_5x5
(a)
reduce 
c_{k-2}
0
dil_conv_5x5
2sep_conv_5x5
c_{k-1} avg_pool_3x3
dil_conv_3x3
3
dil_conv_5x5
1
dil_conv_3x3
sep_conv_5x5
skip_connect
c_{k}
(b)
Fig. 7: The discovered cell architectures under the MiBB feature fault model. (a) Normal cell. (b) Reduction cell.
ResNet-20 and VGG-16 are trained with a per-MAC fault
injection probability of 1e-4 and 3e-5, respectively.
The discovered cell architectures are shown in Fig. 7, and
the evaluation results are shown in Table IV. The discovered
architecture F-FTT-Net outperforms the baselines significantly
at various fault ratios. In the meantime, compared with the
most efficient baseline MobileNet-V2, the FLOPs number of
F-FTT-Net is comparable, and the parameter number is only
28.3% (0.65M versus 2.30M). If we require that the accuracy
should be kept above 70%, MobileNet-V2 could function with
a per-MAC error rate of 3e-6, and F-FTT-Net could function
with a per-MAC error rate larger than 1e-4. That is to say,
while meeting the same accuracy requirements, F-FTT-Net
could function in an environment with a much higher SER.
We can see that FTT-NAS is much more effective than
its degraded variant, FT-NAS. We conclude that, generally,
NAS should be used in conjunction with FTT, as suggested by
Eq. 11. Another interesting fact is that, under the MiBB fault
model, the relative rankings of the resilience capabilities of dif-
ferent architectures change after FTT: After FTT, MobileNet-
V2 suffers from the smallest accuracy degradation among 3
baselines, whereas it is the most vulnerable one without FTT.
C. Defend Against adSAF Weight Faults
We conduct FT-NAS and FTT-NAS under the adSAF model.
The overall SAF ratio p = p0 + p1 is set to 8%, in which
the proportion of SAF0 and SAF1 is 83.7% and 16.3%,
respectively (p0=6.7%, p1=1.3%). The reward coefficient αr
is set to 0.2. The loss coefficient αl in FTT-NAS is set to 0.7.
The discovered cell architectures are shown in Fig. 8. As
shown in Table V, the discovered W-FTT-Net outperforms
the baselines significantly at various test SAF ratios, with
comparable FLOPs and less parameter number. We then apply
channel augmentation to the discovered architecture to explore
the performance of the model at different scales. We can see
that models with larger capacity have better reliability under
the adSAF weight fault model, e.g., 54.2% (W-FTT-Net-40)
VS. 38.4% (W-FTT-Net-20) with 10% adSAF faults.
To investigate whether the model FTT-trained under the
adSAF fault model can tolerate other types of weight faults, we
evaluate the reliability of W-FTT-Net under 1bit-adSAF model
and the iBF model. As shown in Fig. 9 (b)(c), under the 1bit-
adSAF and iBF weight fault model, W-FTT-Net outperforms
all the baselines consistently at different noise levels.
D. The Effectiveness of The Learned Controller
To demonstrate the effectiveness of the learned controller,
we compare the performance of the architectures sampled
by the controller, with the performance of the architectures
random sampled from the search space. For both the MiBB
feature fault model and the adSAF weight fault model, we
random sample 5 architectures from the search space, and train
them with FTT for 100 epochs. A per-MAC fault injection
probability of 3e-4 is used for feature faults, and an SAF ratio
of 8% (p0=6.7%, p1=1.3%) is used for weight faults.
11
TABLE V: Comparison of different architectures under the adSAF weight fault model
Arch Training Cleanaccuracy
Accuracy with weight faults (%) #FLOPs #Params0.04 0.06 0.08 0.10 0.12
ResNet-20 clean 94.7 64.8 34.9 17.8 12.4 11.0 1110M 11.16M
VGG-16 clean 93.1 45.7 21.7 14.3 12.6 10.6 626M 14.65M
MobileNet-V2 clean 92.3 26.2 14.3 11.7 10.3 10.5 182M 2.30M
W-FT-Net-20 clean 91.7 54.2 30.7 19.6 15.5 11.9 1020M 3.05M
ResNet-20 p=0.08 92.0 86.4 77.9 60.8 41.6 25.6 1110M 11.16M
VGG-16 p=0.08 91.1 82.6 73.3 58.5 41.7 28.1 626M 14.65M
MobileNet-V2 p=0.08 86.3 76.6 55.9 35.7 18.7 15.1 182M 2.30M
W-FTT-Net-20† p=0.08 90.8 86.2 79.5 69.6 53.5 38.4 919M 2.71M
W-FTT-Net-40 p=0.08 92.1 88.8 85.5 79.3 69.2 54.2 3655M 10.78M
†: The “-N” suffix means that the base of the channel number is N .
normal 
c_{k-2}
0
relu_conv_bn_3x3
relu_conv_bn_5x5
2
sep_conv_3x3
c_{k-1}
1conv_1x1
dil_conv_3x3
sep_conv_3x3
3
relu_conv_bn_5x5
max_pool_3x3
c_{k}
(a)
reduce 
c_{k-2}
0
skip_connect
1
sep_conv_3x3
c_{k-1}
relu_conv_bn_5x5
2
max_pool_3x3
dil_conv_3x3
3
skip_connect
c_{k}
conv_1x1
dil_conv_3x3
(b)
Fig. 8: The discovered cell architectures under the adSAF weight fault model. (a) Normal cell. (b) Reduction cell.
(c)(a)
0 0.02 0.04 0.06 0.08 0.1 0.12
Injection Ratio of SAF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ac
cu
ra
cy
W-FTT-Net
VGG-16
ResNet-20
MobileNet-V2
0 0.02 0.04 0.06 0.08 0.1 0.12
Injection Ratio of SAF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ac
cu
ra
cy
W-FTT-Net
VGG-16
ResNet-20
MobileNet-V2
0 0.02 0.04 0.06 0.08 0.1 0.12
Inject Ratio of Bit-Flip
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Ac
cu
ra
cy
W-FTT-Net
VGG-16
ResNet-20
MobileNet-V2
(b)
Fig. 9: Accuracy curves under different weight fault models. (a) W-FTT-Net under 8bit-adSAF model. (b) W-FTT-Net under
1bit-adSAF model. (c) W-FTT-Net under iBF model.
As shown in Table VI and Table VII, the performance of
different architectures in the search space varies a lot, and the
architectures sampled by the learned controllers, F-FTT-Net
and W-FTT-Net, outperform all the random sampled archi-
tectures. Note that, as we use different preprocess operations
for feature faults and weight faults (ReLU-Conv-BN 3x3 and
SepConv 3x3, respectively), there exist differences in FLOPs
and parameter number even with the same cell architectures.
E. Inspection of the Discovered Architectures
Feature faults: From the discovered cell architectures
shown in Fig. 7, we can observe that the controller obviously
prefers SepConv and DilConv blocks over Relu-Conv-BN
blocks. This observation is consistent with our anticipation. As
TABLE VI: RNN controller VS. random samples under the
MiBB feature fault model
Model clean acc pm=3e-4 #FLOPs #Params
sample1 60.2 19.5 281M 0.81M
sample2 79.7 29.7 206M 0.58M
sample3 25.0 32.2 340M 1.09M
sample4 32.9 25.8 387M 1.23M
sample5 17.4 10.8 253M 0.77M
F-FTT-Net 88.6 51.0 245M 0.65M
under the MiBB feature fault model, operations with smaller
FLOPs will result in a lower equivalent fault rate in the OFM.
Under the MiBB feature fault model, there is a tradeoff
12
TABLE VII: RNN controller VS. random sample under the
adSAF weight fault model
Model clean acc p=8% #FLOPs #Params
sample1 90.7 63.6 705M 1.89M
sample2 84.7 36.7 591M 1.54M
sample3 90.3 60.3 799M 2.33M
sample4 90.5 64.0 874M 2.55M
sample5 85.2 45.6 665M 1.83M
W-FTT-Net 90.7 68.5 919M 2.71M
between the capacity of the model and the feature error rate.
As the number of channels increases, the operations become
more expressive, but the equivalent error rates in the OFMs
also get higher. Thus there exists a tradeoff point of c∗ for the
number of channels. Intuitively, c∗ depends on the per-MAC
error rate pm, the larger the pm is, the smaller the c∗ is.
Besides the choices of primitives, the connection pattern
and combination of different primitives also play a role in
making the architecture fault-tolerant. To verify this, first,
we conduct a simple experiment to confirm the preference
of primitives: For each of the 4 different primitives (SepConv
3x3, SepConv 5x5, DilConv 3x3, DilConv 5x5), we stack 5
layers of the primitives, get the performance of the stacked
NN after FTT training it with pm=3e-4. The stacked NNs
achieve the accuracy of 60.0%, 65.1%, 50.0% and 56.3% with
pm =1e-4, respectively. The stacked NN of SepConv 5x5
blocks achieves the best performance, which is of no surprise
since the most frequent block in F-FTT-Net is SepConv5x5.
Then, we construct six architectures by random sampling five
architectures with only SepConv5x5 connections and replacing
all the primitives in F-FTT-Net with SepConv 5x5 blocks.
The best result achieved by these six architecture is 77.5%
with pm =1e-4 (versus 86.2% achieved by F-FTT-Net). These
illustrative experiments indicate that the connection pattern
and combination of different primitives all play a role in the
fault resilience capability of a neural network architecture.
Weight faults: Under the adSAF fault model, the controller
prefers ReLU-Conv-BN blocks over SepConv and DilConv
blocks. This preference is not so easy to anticipate. We
hypothesise that the weight distribution of different primitives
might lead to different behaviors when encountering SAF
faults. For example, if the quantization range of a weight value
is larger, the value deviation caused by an SAF1 fault would
be larger, and we know that a large increase in the magnitude
of weights would damage the performance severely [21]. We
conduct a simple experiment to verify this hypothesis: We
stack several blocks to construct a network, and in each block,
one of the three operations (a SepConv3x3 block, a ReLU-
Conv-BN 3x3 block, and a ReLU-Conv-BN 1x1 block) is
randomly picked in every training step. The SepConv 3x3
block is constructed with a DepthwiseConv 3x3 and two
Conv 1x1, and the ReLU-Conv-BN 3x3 and ReLU-Conv-BN
1x1 contain a Conv 3x3 and a Conv 1x1, respectively. After
training, the weight magnitude ranges of Conv 3x3, Conv
1x1, and DepthwiseConv 3x3 are 0.036±0.043, 0.112±0.121,
0.140±0.094, respectively. Since the magnitude of the weights
in 3x3 convolutions is smaller than that of the 1x1 convolutions
and the depthwise convolutions, SAF weight faults would
cause larger weight deviations in a SepConv or DilConv block
than in a ReLU-Conv-BN 3x3 block.
VI. DISCUSSION
Orthogonality: Most of the previous methods are exploiting
the inherent fault resilience capability of existing NN architec-
tures to tolerate different types of hardware faults. In contrast,
our methods improve the inherent fault resilience capability
of NN models, thus effectively increase the algorithmic fault
resilience “budget” to be utilized by hardware-specific meth-
ods. Our methods are orthogonal to existing fault-tolerance
methods, and can be easily integrated with them, e.g., helping
hardware-based methods to reduce the overhead largely.
Limitation of application-level fault model: There are
faults that are hard or unlikely to model and mitigate by
our methods, e.g., timing errors, routing/DSP errors in FPGA,
etc. A hardware-in-the-loop framework could be established
for a thorough evaluation of the system-level fault hazards.
Anyway, since the correspondence between these faults and
the application-level elements are subtle, it’s more suitable to
mitigate these faults in the lower abstraction layer.
Hardware: In the MiBB feature fault model, we assume
that the add operations are spatially expanded onto indepen-
dent hardware adders, which applies to the template-based
designs [51]. For ISA (Instruction Set Architecture) based
accelerators [5], the NN computations are orchestrated using
instructions, time-multiplexed onto hardware units. In this
case, the accumulation of the faults follows a different model
and might show different preferences among architectures.
Anyway, the FTT-NAS framework could be used with different
fault models. We leave the exploration and experiments of this
model for future work.
Data representation: In our work, an 8-bit dynamic fixed-
point representation is used for the weights and features. As
pointed out in Sec. V-E, the dynamic range has impacts on
the resilience characteristics against weight faults. The data
format itself obviously decides or affects the data range. [52]
found out that the errors in exponent bits of the 32bit floating-
point weights have large impacts on the performance. [23]
investigated the resilience characteristics of several floating-
point and non-dynamic fixed-point representations.
VII. CONCLUSION
In this paper, we analyze the possible faults in various
types of NN accelerators and formalize the statistical fault
models from the algorithmic perspective. After the analysis,
the MAC-i.i.d Bit-Bias (MiBB) model and the arbitrary-
distributed Stuck-at-Fault (adSAF) model are adopted in the
neural architecture search for tolerating feature faults and
weight faults, respectively. To search for the fault-tolerant
neural network architectures, we propose the multi-objective
Fault-Tolerant NAS (FT-NAS) and Fault-Tolerant Training
NAS (FTT-NAS) method. In FTT-NAS, the NAS technique
is employed in conjunction with the Fault-Tolerant Training
(FTT). The fault resilience capabilities of the discovered
13
architectures, F-FTT-Net and W-FTT-Net, outperform multiple
manually designed architecture baselines, with comparable or
fewer FLOPs and parameters. And W-FTT-Net trained under
the 8bit-adSAF model can defend against several other types
of weight faults. Generally, FTT-NAS is more effective and
should be used. Since operation primitives differ in their
MACs, expressiveness, weight distributions, they exhibit dif-
ferent resilience capabilities under different fault models. The
connection pattern is also shown to have influences on the
fault resilience capability of NN models.
REFERENCES
[1] K. He et al., “Deep residual learning for image recognition,” in CVPR,
2016.
[2] W. Liu et al., “Ssd: Single shot multibox detector,” in ECCV. Springer,
2016, pp. 21–37.
[3] J. Long et al., “Fully convolutional networks for semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 3431–3440.
[4] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator
for ubiquitous machine-learning,” in ACM Sigplan Notices, 2014.
[5] J. Qiu et al., “Going deeper with embedded fpga platform for convolu-
tional neural network,” in FPGA. ACM, 2016, pp. 26–35.
[6] P. Chi et al., “Prime: A novel processing-in-memory architecture for
neural network computation in reram-based main memory,” in ISCA,
ser. ISCA ’16. IEEE Press, 2016, pp. 27–39.
[7] C. Szegedy et al., “Intriguing properties of neural networks,” arXiv
preprint arXiv:1312.6199, 2013.
[8] A. Shafahi et al., “Poison frogs! targeted clean-label poisoning attacks
on neural networks,” in NIPS, 2018, pp. 6103–6113.
[9] Q. Zhang et al., “Interpreting cnn knowledge via an explanatory graph,”
in AAAI, 2018.
[10] J. Henkel et al., “Reliable on-chip systems in the nano-era: Lessons
learnt and future trends,” in DAC. ACM, 2013, p. 99.
[11] C. Chen et al., “Rram defect modeling and failure analysis based on
march test and a novel squeeze-search scheme,” IEEE Transactions on
Computers, vol. 64, no. 1, pp. 180–190, Jan 2015.
[12] L. Xia et al., “Stuck-at fault tolerance in rram computing systems,” IEEE
Journal on Emerging and Selected Topics in Circuits and Systems, vol. 8,
pp. 102–115, 2018.
[13] Y. Zhao et al., “Memory trojan attack on neural network accelerators,”
in DATE. IEEE, 2019, pp. 1415–1420.
[14] C. Liu et al., “Rescuing memristor-based neuromorphic design with high
defects,” in DAC, June 2017, pp. 1–6.
[15] J.-C. Vialatte et al., “A study of deep learning robustness against
computation failures,” arXiv:1704.05396, 2017.
[16] A. G. Christoph Schorn et al., “Accurate neuron resilience prediction
for a flexible reliability management in neural network accelerators,” in
DATE, 2018.
[17] C. Bolchini et al., “Tmr and partial dynamic reconfiguration to mitigate
seu faults in fpgas,” 10 2007, pp. 87–95.
[18] X. She et al., “Reducing critical configuration bits via partial tmr for
seu mitigation in fpgas,” IEEE Transactions on Nuclear Science, 2017.
[19] Z. Zhao et al., “Fine-grained module-based error recovery in fpga-based
tmr systems,” TRETS, vol. 11, no. 1, p. 4, 2018.
[20] Z. He et al., “Noise injection adaption: End-to-end reram crossbar non-
ideal effect adaption for neural network mapping,” in DAC. ACM,
2019.
[21] G. B. Hacene et al., “Training modern deep neural networks for memory-
fault robustness,” in ISCAS. IEEE, 2019.
[22] A. P. Arechiga et al., “The robustness of modern deep learning archi-
tectures against single event upset errors,” in HPEC, 2018.
[23] G. Li et al., “Understanding error propagation in deep learning neural
network (dnn) accelerators and applications,” in ACM/IEEE Supercom-
puting Conference. ACM, 2017, p. 8.
[24] K. Guo et al., “[dl] a survey of fpga-based neural network inference
accelerators,” ACM TRETS, vol. 12, no. 1, pp. 2:1–2:26, Mar. 2019.
[25] I. Hubara et al., “Quantized neural networks: Training neural networks
with low precision weights and activations,” JMLR, vol. 18, 2017.
[26] S. Borkar, “Designing reliable systems from unreliable components: the
challenges of transistor variability and degradation,” Ieee Micro, vol. 25,
no. 6, pp. 10–16, 2005.
[27] C. Slayman, “Soft error trends and mitigation techniques in memory
devices,” in Proceedings - Annual Reliability and Maintainability Sym-
posium, Jan 2011, pp. 1–5.
[28] F. Libano et al., “Selective hardening for neural networks in fpgas,”
IEEE Transactions on Nuclear Science, 2018.
[29] C. Carmichael et al., “Correcting single-event upsets through virtex
partial configuration,” 6 2000.
[30] B. Q. Le et al., “Resistive ram with multiple bits per cell: Array-
level demonstration of 3 bits per cell,” IEEE Transactions on Electron
Devices, vol. 66, no. 1, pp. 641–646, Jan 2019.
[31] B. Liu et al., “Vortex: Variation-aware training for memristor x-bar,” in
DAC, 2015, pp. 1–6.
[32] S. Kannan et al., “Modeling, detection, and diagnosis of faults in
multilevel memristor memories,” IEEE TCAD, 2015.
[33] S. Kannan, J. Rajendran et al., “Sneak-path testing of memristor-based
memories,” in 26th International Conference on VLSI Design, 2013.
[34] L. Xia et al., “Fault-tolerant training with on-line fault detection for
rram-based neural computing systems,” in DAC, June 2017, pp. 1–6.
[35] L. Chen et al., “Accelerator-friendly neural-network training: Learning
variations and defects in rram crossbar,” in DATE, 2017, pp. 19–24.
[36] T. Liu et al., “A fault-tolerant neural network architecture,” in DAC,
2019, pp. 55:1–55:6.
[37] B. Zoph et al., “Neural architecture search with reinforcement learning,”
ICLR, 2017.
[38] H. Pham et al., “Efficient neural architecture search via parameter
sharing,” in ICML, 2018.
[39] H. Liu et al., “Darts: Differentiable architecture search,” arXiv preprint
arXiv:1806.09055, 2018.
[40] E. Real, A. Aggarwal et al., “Aging evolution for image classifier
architecture search,” in AAAI, 2019.
[41] B. Baker et al., “Accelerating neural architecture search using perfor-
mance prediction,” arXiv preprint arXiv:1705.10823, 2017.
[42] M. Hu et al., “Bsb training scheme implementation on memristor-based
circuit,” in CISDA. IEEE, 2013, pp. 80–87.
[43] M. Haghi et al., “The 90 nm double-dice storage element to reduce
single-event upsets,” in MWSCAS. IEEE, 2009, pp. 463–466.
[44] B. Reagen et al., “Ares: A framework for quantifying the resilience of
deep neural networks,” in DAC, ser. DAC ’18. ACM, 2018, pp. 17:1–
17:6.
[45] H. Asadi et al., “Analytical techniques for soft error rate modeling and
mitigation of fpga-based designs,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 15, no. 12, pp. 1320–1331, Dec 2007.
[46] Z. Zhu et al., “A configurable multi-precision cnn computing framework
based on single bit rram,” in DAC, June 2019, pp. 1–6.
[47] D. P. Kingma et al., “Adam: A method for stochastic optimization,”
ICLR, 2015.
[48] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine Learning, vol. 8, no.
3-4, pp. 229–256, 1992.
[49] A. Krizhevsky et al., “Learning multiple layers of features from tiny
images,” 2009.
[50] I. Loshchilov et al., “Sgdr: Stochastic gradient descent with warm
restarts,” ICLR, 2017.
[51] S. I. Venieris et al., “fpgaconvnet: Automated mapping of convolutional
neural networks on fpgas (abstract only),” in FPGA. ACM, 2017, pp.
291–292.
[52] Z. Yan et al., “When single event upset meets deep neural networks:
Observations, explorations, and remedies,” in ASP-DAC, 2020.
