Abstract-Because deep neural networks (DNNs) rely on a large number of parameters and computations, their implementation in energy-constrained systems is challenging. In this paper, we investigate the solution of reducing the supply voltage of the memories used in the system, which results in bit-cell faults. We explore the robustness of state-of-the-art DNN architectures towards such defects and propose a regularizer meant to mitigate their effects on accuracy. Our experiments clearly demonstrate the interest of operating the system in a faulty regime to save energy without reducing accuracy.
I. INTRODUCTION
Deep Neural Networks [1] (DNNs) are the golden standard for many challenges in machine learning. Thanks to the large number of trainable parameters that they provide, DNNs can capture the complexity of large training datasets and generalize to previously unseen examples.
Among the many applications for DNNs, many are in the field of embedded systems. Examples include monitoring of health signals, human-machine interfaces, autonomous drones, and smartphone applications. Many such embedded applications cannot rely on cloud-based processing because of stringent latency constraints or privacy issues. Even when cloud processing is an option, processing in-device or at the network edge can be useful to save network bandwidth. The energy consumption of the inference task is thus a major concern. Unfortunately, because state-of-the-art DNN architectures are composed of a large number of trained parameters, the inference step typically requires significant energy to achieve accurate results on challenging tasks, with a large part of the energy complexity being associated with the memory accesses required to retrieve the parameters and save temporary results.
Since off-chip memory accesses consume significant energy, a first step for reducing energy consumption consists in storing all parameters and temporary results on the same chip as the hardware accelerator, using static random access memories (SRAMs), or potentially embedded dynamic RAMs [2] . However, even in this case, the energy consumed by memory accesses still represents 30-60% of the total energy [3] . An effective way of lowering energy consumption of both memory and logic circuits is to reduce the supply voltage, but this has the effect of increasing the sensitivity of the circuits to fabrication variations, causing bit-cell failures in SRAMs. When approaching the minimum energy operating point of SRAMs, the failure rates increase by several orders of magnitude compared to operating at the nominal supply [4] . However, even such large bit-cell failure rates are not necessarily catastrophic if appropriate mechanisms are in place to safeguard the operation of the system. DNNs naturally exhibit a limited amount of fault tolerance, as noted for instance in [5] , [6] , and there is a growing body of work that studies the operation of DNN inference hardware built using faulty memories. We review several contributions in Section II. The aim of this paper is to investigate the ability to decrease the energy consumption of DNN accelerators by allowing the memories used for storing weights and activations to operate in a faulty regime, thus introducing deviations on the stored values. We rely on simple but realistic energydeviation models to explore the impact of memory failures on classification accuracy, and ultimately on energy consumption.
We quantify the impact on robustness of several design aspects of state-of-the-art deep architectures in order to identify whether these aspects should be targeted when designing robust architectures. Specifically, we consider the choice of general architecture, how the depth of a layer impacts its robustness, and the impact of faults occurring in the storage of weights or of neuron activations. Interestingly, we find that different architectures provide varying degrees of robustness.
We then consider whether faulty operation can lead to a reduction in power consumption. Importantly, we compare the energy consumption with a reliable reference implementation that achieves the same application performance. We show that using a faulty implementation to reduce energy consumption at the cost of a reduction in accuracy is not necessarily beneficial, even when the loss in accuracy appears small. Indeed, for state-of-the-art architectures, accepting even a 1% reduction in accuracy can significantly reduce the number of parameters required by a reliable implementation. It it thus essential to evaluate the improvement provided by a faulty architecture at the same accuracy. Nonetheless, we show that faulty operation can reduce energy consumption when the fault statistics are taken into account during training.
The outline of the paper is as follows. Section II briefly reviews related work. Section III introduces the deviation models, which represent the impact of circuit faults on the algorithm. Section IV presents an exploration of the design space for faulty-memory implementations of modern DNNs. Section V proposes a regularizer to increase the robustness of DNNs to deviations. Section VI provides some conclusions.
II. RELATED WORK
The idea of exploiting fault tolerance to improve the energy efficiency of neural networks has attracted a significant number of contributions. An early investigation of the effect of transistor-level defects on neural networks was performed in [7] . More recently, circuit-level methods for improving the application performance of faulty implementations have been proposed. One approach consists in using razor flipflops to detect faults and selectively apply a compensation mechanism. When memory faults can be detected at the bit level, a bit masking technique can be applied to ensure that errors always reduce the magnitude of weights, helping to decrease the impact of the errors on performance [8] , [9] . Similarly, razor flip-flops can be used to compensate timing violations occurring in the datapath by dropping the next operation, which effectively sets its weight parameter to zero [10] . Finally, a low-precision replica can be added to computations units to bound the maximum error that can be introduced by a faulty processing unit [11] .
To the best of our knowledge, few papers investigate the effect of training deep architectures to increase fault robustness. One notable exception is [3] , which proposes modifying the training procedure to take into account bit flips occurring in SRAMs, and present results on the MNIST benchmark [12] . The effect of faults occurring in the storage of the input is also considered in [13] , and [14] proposes on-chip learning for support-vector machines, while decreasing the learning effort using active learning. Finally, a slightly different problem is considered in [15] , [16] , where the network is trained to compensate for known defect locations.
Another line of work consists in compressing models to reduce memory usage and number of computations. There are mainly three ways to achieve this. A first one is to quantize weights, using in the extreme case only one bit per weight and per activation [17] - [19] . While the process has proven very efficient on old and somewhat redundant architectures, it can drastically affect accuracy when performed on already compressed architectures. A second way to compress DNNs is to prune the weights, significantly reducing the number of parameters to be stored [20] . A last line of work consists in factorizing weights, so that they can be used to perform multiple computations throughout the processing of an input [21] - [23] . However, in modern architectures the number of weights is only a small portion of the memory, as activations of neurons can be as many and even more if the batch size is large, that is if several inputs are processed in parallel.
III. ENERGY-DEVIATION MODEL
We focus on the energy consumed by memory accesses, and assume that the amount of energy required to perform an inference task is proportional to the number of accesses. We thus define a base energy metric E o that is the sum of the number of parameters and of the number of activation values generated during the inference.
To decrease the energy consumption of on-chip memories, we consider reducing the supply voltage, which in turn causes some bit cells to fail. In order to investigate the general behavior of DNNs implemented with faulty memory, we need a model linking the bit-cell fault probability p and the energy consumed by memory accesses. We denote by 0 ≤ η ≤ 1 the normalized energy consumption of the memory, where the normalization is with respect to the energy consumption of the reliable memory (such that the energy is given by ηE o ). Note that we can obtain a simple upper bound for p from the fact that instead of using a faulty memory, we could store only a fraction η of the data while declaring the missing bit-cells as faulty, which yields a linearly decreasing p(η).
Based on reliability data published in [4, Fig.7 ], we will assume that the energy-reliability function takes the exponential form
In order to obtain a specific value of parameter a for illustrative purposes, we select a to fit the energy data reported in [24, Fig.1 ] and the reliability from [4, Fig.7 ] for 65nm CMOS SRAM cells at V DD ∈ {0.5, 1.1}. Performing the fit by minimizing the sum of the relative squared error yields a = 12.8. Specific energy gains will vary based on the value of a, but in this paper we are only interested in identifying general trends. The manner in which memory faults introduce deviations during inference depends on the strategy being used to cope with faults. We consider the case where bit-cell faults can be detected, and use the bit masking (BM) approach proposed in [8] . When a fault is detected on the sign bit of a value, this value is replaced with zero. In the case of failures on any other bits, the affected bit values are replaced with the sign bit, causing the value to deviate towards zero. We consider that all bit cells have an equal fault probability p. When using the deviation model in simulations, we assume that values are quantized on 8 bits. However, for a fair comparison with the reliable implementations that use a floating-point representation, we compute the deviation in the quantized domain, but apply it on the floating-point representation. Unless otherwise mentioned, we consider that faults affect both the weights and the neuron activations. Note that activations are known to be positive since they are generated by a ReLU function. Therefore, we assume that their sign bit cannot be affected.
In this work, we use our deviation model during the training phase to increase the robustness of networks and thus their energy efficiency. Because training is computationally intensive, we propose to simplify the BM deviation model used during training to speed up the process. Since the BM approach always causes values to deviate towards zero, we propose to approximate it using a deviation model that will be referred to as the erasure model, for which each value has a probability p e of being set to zero. We then need to choose p e to best approximate the effect of the BM model. We can first note that in the case of weight parameters, the BM model sets the faulty value to zero in case of a sign-bit fault, which occurs with probability p. Therefore, we clearly need p e > p. During training, this process is similar to dropout [25] , but it is used to increase the robustness of networks, and not to prevent overfitting. To find the best choice of p e to approximate the BM model, we evaluate the performance of both models on the test set and choose the value of p e that best predicts the accuracy of the network under the BM model.
IV. DESIGN-SPACE EXPLORATION FOR FAULTY IMPLEMENTATIONS

A. Choice of architecture and dataset
We perform experiments using the CIFAR10 dataset [26] made of tiny color images of 32×32 pixels. We first compare four architectures, namely PreActResNet18 [27] , MobileNetV2 [28] , SENet18 [29] and ResNet18 [30] , which are all modern architectures achieving good accuracy on CIFAR10. Table I shows for each architecture the number of weights (parameters) and activation values of neurons that must be retrieved from memory for processing one input, and the accuracy achieved by that architecture.
In Fig. 1 , we compare the robustness of the abovementioned architectures when the parameters and activations are affected by the BM deviation model. We observe that some architectures are inherently more robust than others, and that this does not depend solely on the global number of parameters. In Fig. 2 , we plot the accuracy in terms of the energy ηE o per inference, where the base energy E o corresponds to the sum of the parameter and activation columns of Table I , and the fault probability is obtained from the normalized energy η using (1). We observe that PreActResNet18 provides a very interesting trade-off between accuracy, memory accesses and robustness to BM. Therefore we choose to focus on this architecture for the remaining experiments. 
B. Comparison of the BM and erasure models
As motivated in Section III, we are interested in comparing the effects of BM and erasures on the chosen architecture. Results are depicted in Fig. 3 . Since the BM model affects weights and activations differently and since PreActResNet18 has about 20× more weights than activation values, we focus on matching the accuracy of the two models when only weights are affected by deviations. We observe for this case that the BM and erasure models have a similar effect, provided that p e = 2p, suggesting that using erasures as a proxy to model the deviations induced by BM is a reasonable option. This relation will be used in Section V to train networks to be more resilient to BM deviations.
C. Relative importance of layer depth
In a new series of experiments, we aim at identifying the relative robustness of various parts of the architecture under BM deviations. To this end, we introduce deviations on only a portion of the network. Since PreActResNet18 is composed of 4 sequential blocks (made of convolutional layers and shortcuts), we apply BM deviations to the weights and activations of only one block at a time. Results are depicted in Fig. 4 . We observe that all parts of the network are sensitive to deviations. Interestingly, in the region of small accuracy degradation shown in Fig. 4 , robustness increases monotonically with the depth of the block. We thus consider exploiting the varying robustness of the layers to improve energy consumption by assigning different operating points to each block. Denoting by p Bi the fault probability assigned to block i, we note from 
D. Impact of the number of parameters in the architecture
The number of parameters can be easily adapted by modifying the number of feature maps in the convolutional layers. If the number of feature maps of each convolutional layer is multiplied by k, then the total number of parameters will be roughly multiplied by k 2 , as the number of parameters in a convolutional layer increases linearly with both the number of input feature maps and the number of output feature maps.
We train two variants of the PreActResNet architecture in which the original number F of feature maps is multiplied by 1/2 and 1/ √ 2. These networks are used to provide a reference for the performance achieved with faulty implementations. The F/2 and F/ √ 2 networks achieve respectively an accuracy of 93.45% and 94.41% under reliable implementations, illustrating the fact that significant energy reductions can be obtained easily if a reduced accuracy is acceptable.
V. PROPOSED REGULARIZER
All previous experiments confirm that modern DNN architectures can tolerate some amount of deviations. However, in all the scenarios considered, we observe a sharp drop in performance as soon as the probability p of defect becomes too large or the energy too small. To improve the robustness to deviations, we consider training the networks in the same conditions they are used in, which means that we apply erasures during the forward pass of the training phase. We call this method the erasure regularizer. Note that the reason that we use erasure rather than BM deviations is to speed up the training process. In Fig. 5 , we plot the accuracy of the networks as a function of the energy they use. We compare reliable implementations of networks with varying number of parameters with the performance obtained when reducing the supply voltage of memories. For the specific energy model discussed in Section III, the best energy reduction obtained by the faulty implementations with F feature maps is 1.5× for the network with standard training, achieved at an accuracy of 94.76% and a fault rate of p = 0.001, while the best energy reduction obtained using the erasure regularizer is 2.3× at an accuracy of 94.8% and p = 0.01. Furthermore, additional gains can be obtained by combining the erasure regularizer with blockwise reliability assignment of Sect. IV-C. We thus see that training the network for robustness using the erasure regularizer can significantly improve the energy reduction obtained from faulty operation under the bit-masking model, at equal accuracy. As discussed in Sect. IV-B, it is important to perform the training with the appropriate p e parameter: using an erasure regularizer with p e = p did not yield an improvement in robustness.
VI. CONCLUSION
In this work, we explored the possibility of exploiting the fault tolerance of deep neural networks to reduce the energy consumption of on-chip memories. We showed that in some conditions, reducing the supply voltage can result in better accuracy for the same energy consumption compared to reducing the number of parameters. We showed that a deviation model corresponding to detectable bit-cell faults combined with a bit masking technique can be replaced by a simpler erasure model to speed up the training, and that the use of this regularizer during the training phase allows to further reduce the energy with no impact on accuracy.
Finding the architecture that achieves the best accuracy for a given energy budget still remains a highly open question, considering the very large number of possible solutions. As such, a more systematic study of the combined impact of pruning, quantizing, factorizing, reducing the number of parameters, tweaking hyperparameters and reducing supply voltage is a very promising direction for future work.
