

# LJMU Research Online

Joksas, D, Freitas, P, Chai, Z, Ng, WH, Buckwell, M, Li, C, Zhang, WD, Xia, QF, Kenyon, AJ and Mehonic, A

Committee Machines—A Universal Method to Deal with Non-Idealities in Memristor-Based Neural Networks

http://researchonline.ljmu.ac.uk/id/eprint/13408/

Article

**Citation** (please note it is advisable to refer to the publisher's version if you intend to cite from this work)

Joksas, D, Freitas, P, Chai, Z, Ng, WH, Buckwell, M, Li, C, Zhang, WD, Xia, QF, Kenyon, AJ and Mehonic, A Committee Machines—A Universal Method to Deal with Non-Idealities in Memristor-Based Neural Networks. Nature Communications. ISSN 2041-1723 (Accepted)

LJMU has developed LJMU Research Online for users to access the research output of the University more effectively. Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Users may download and/or print one copy of any article(s) in LJMU Research Online to facilitate their private study or for non-commercial research. You may not engage in further distribution of the material or use it for any profit-making activities or any commercial gain.

The version presented here may differ from the published version or from the version of the record. Please see the repository URL above for details on accessing the published version and note that access may require a subscription.

For more information please contact <a href="mailto:researchonline@ljmu.ac.uk">researchonline@ljmu.ac.uk</a>

http://researchonline.ljmu.ac.uk/

## Committee Machines—A Universal Method to Deal with 1 Non-Idealities in Memristor-Based Neural Networks 2 D. Joksas<sup>1</sup>, P. Freitas<sup>2</sup>, Z. Chai<sup>2</sup>, W. H. Ng<sup>1</sup>, M. Buckwell<sup>1</sup>, 3 C. Li<sup>3</sup>, W. D. Zhang<sup>2</sup>, Q. Xia<sup>3</sup>, A. J. Kenyon<sup>1</sup>, and A. Mehonic<sup>1</sup> <sup>1</sup>Department of Electronic and Electrical Engineering, University College London, London (United Kingdom) 6 <sup>2</sup>Department of Electronics and Electrical Engineering, Liverpool John Moores University, Liverpool (United Kingdom) <sup>3</sup>Department of Electrical and Computer Engineering, g University of Massachusetts Amherst (United States of America) 10 Abstract 11 Artificial neural networks are notoriously power- and time-consuming when implemented on con-12 ventional von Neumann computing systems. Consequently, recent years have seen an emergence 13 of research in machine learning hardware that strives to bring memory and computing closer to-14 gether. A popular approach is to realise artificial neural networks in hardware by implementing 15 16

their synaptic weights using memristive devices. However, various device- and system-level nonidealities usually prevent these physical implementations from achieving high inference accuracy. 17 We suggest applying a well-known concept in computer science—committee machines—in the con-18 text of memristor-based neural networks. Using simulations and experimental data from three 19 different types of memristive devices, we show that committee machines employing ensemble aver-20 aging can successfully increase inference accuracy in physically implemented neural networks that 21 suffer from faulty devices, device-to-device variability, random telegraph noise and line resistance. 22 Importantly, we demonstrate that the accuracy can be improved even without increasing the total 23 number of memristors. 24

#### 25 I. INTRODUCTION

Artificial neural networks (ANNs), with all of their variants, are now the main tools in 26 machine learning tasks, such as classification. The vast amounts of data being constantly 27 produced have enabled successful training and operation of ANNs. However, to achieve 28 high inference accuracy, it is usually necessary for neural networks to have a large number of 29 parameters. This results in both training [1] and inference [2] stages being time- and power-30 consuming. This is largely caused by the need to transfer data from memory to computing 31 units—physical separation of memory and computing is the essence of any von Neumann 32 system. 33

One of the most promising solutions to these problems is the paradigm of non-von Neu-34 mann computing and, specifically, analogue implementations of synapses (weights) in phys-35 ical ANNs. Because there are many more synapses than there are neurons in ANNs, the 36 matrix-vector multiplications, in which the synaptic weight values are used, are the costli-37 est operations in these networks, both in terms of power and time. Computing directly in 38 memory would minimise data transfers from off-chip memory, thus the most popular ap-39 proach is using analogue memory devices as proxies for synaptic weights of ANNs (both 40 fully connected and their variants [3, 4]). A common technique is to arrange such devices 41 in a structure, called crossbar array, in which every device (or a pair of devices) is used to 42 represent a single synaptic weight or, more generally, an entry in a matrix [5]. Memristive 43 devices, such as phase-change memories (PCMs) [6, 7] or resistive random-access memories 44 (RRAMs) [8, 9], have been considered as candidates for such tasks. Although here we fo-45 cus on ex-situ training, such systems have been successfully utilised for in-situ training too 46 [10, 11].47

In memristive implementations of ANNs, the main concern is that various non-idealities 48 associated with these devices can prevent these systems from achieving high accuracy [12, 49 13]. Examples of non-idealities affecting inference accuracy include, but are not limited 50 to, devices not being able to electroform, devices stuck in one of the resistance states after 51 electroforming, device-to-device (D2D) variability and random telegraph noise (RTN). When 52 training analogue systems in-situ, limited endurance and non-linear resistance modulation 53 too have to be taken into account. To mitigate the effects of these device non-idealities, it is 54 often necessary to modify device structure [9], to use more advanced programming schemes 55

<sup>56</sup> [14] or to use additional circuitry [15] or high-precision processing units [16] in conjunction <sup>57</sup> with memristive elements. On the system level, there is an issue of line resistance which <sup>58</sup> affects the distribution of currents and thus decreases the accuracy. These line resistance <sup>59</sup> effects can be partially compensated for algorithmically [17] or partially mitigated by using <sup>60</sup> multiple smaller crossbar arrays [18]. Examples of past efforts at dealing with these and <sup>61</sup> other non-idealities of memristive devices and systems are listed in Table I; most of these <sup>62</sup> non-idealities are still the main focus of the research in the neuromorphic community.

<sup>63</sup> We propose a simple way to mitigate the effects of all types of non-idealities during <sup>64</sup> inference. We suggest combining several non-ideal memristor-based neural networks into <sup>65</sup> committees to achieve better accuracy. The committee machine (CM) method we propose <sup>66</sup> significantly increases the inference accuracy and does not increase the computation time <sup>67</sup> because memristive ANNs in such committees work in parallel.

In this work, we firstly explain the simulation setup—what networks were trained, 68 how they were simulated and how they were combined into CMs. After that, follows 69 the experimental part. We investigate three different types of memristor technology— 70 tantalum/hafnium oxide-based ( $Ta/HfO_2$ ), tantalum oxide-based ( $Ta_2O_5$ ), and amorphous 71 vacancy modulated conductive oxide-based (aVMCO) devices. By exploring their non-72 idealities relevant to inference—faulty devices, D2D variability, RTN, and line resistance— 73 we use the experimental data to simulate memristive ANNs working individually and in 74 committees. 75

#### 76 II. RESULTS

#### 77 A. Simulation setup

Fully connected ANNs were trained in software to recognise handwritten digits (using MNIST data base [19]). Architectures with one hidden layer were explored. Unless stated otherwise, the simulations used networks with 25 hidden neurons. However, networks with 50, 100 and 200 hidden neurons were additionally employed to evaluate the effectiveness of the proposed method while controlling for the total number of memristors required. Following training, weights of ANNs were mapped onto pairs of conductances using proportional mapping scheme (see [20]) to simulate memristor-based ANNs. Finally, these memristive networks were disturbed using experimental data to reflect the effect of device- and systemlevel non-idealities.

After simulating physical non-idealities, the networks were combined into CMs that employed ensemble averaging (EA) [21]. The principle of EA is shown in Figure 1A—several networks are combined in parallel and then their outputs are averaged. After that, the prediction is made using the averaged vector—the prediction is the label corresponding to the largest entry in the vector.

CM methods are frequently used even with conventional ANNs. Methods, such as EA, 92 often produce better accuracy than that of the best individual network in a committee [22]. 93 Although there are other types of CMs besides EA, they often rely on training additional 94 gating networks or boosting networks during the training stage. Using a gating network in 95 this scenario would produce additional problems—to avoid it acting as a performance bottle-96 neck, it too would have to be implemented on crossbar arrays. Various non-idealities would 97 decrease the effectiveness of this gating network which is responsible for making the deci-98 sions about the whole committee of ANNs. Likewise, we speculate that boosting of networks 99 would not be feasible in ex-situ training because it requires information about where indi-100 vidual ANNs perform poorly—this cannot be known precisely until they are implemented 101 physically on crossbar arrays and the non-idealities manifest themselves. To authors' best 102 knowledge, the application of boosting in the context of memristive neural networks seems 103 to have been explored only once before [23]; as expected, it requires training each memristive 104 implementation differently because non-idealities manifest themselves differently in different 105 crossbar arrays. 106

There exist modifications of EA algorithm that could potentially perform better. One 107 example of this is generalized ensemble method (GEM) which, instead of using equal weight-108 ings for each network during averaging (as in EA), uses a different one for each network [21]. 109 These weightings are analytically determined by considering correlation of errors between 110 different networks. But because [21] only considered networks with mean square error loss 111 function (while our networks used cross-entropy loss function), this work does not explore 112 GEM. Instead, we investigated whether it is possible to achieve a better performance by 113 optimising the weightings numerically. This method, like GEM and others previously men-114 tioned, might be impractical because, firstly, these weightings could be determined only after 115 the ANNs are physically implemented on crossbars, and, secondly, the devices could change 116

<sup>117</sup> throughout their lifetimes thus affecting the optimal weightings.

Even with the assumption that the devices would have perfect retention, we found that optimisation of weightings achieves effectively the same performance. Because of these reasons, we focus only on EA in the main text, but present our results of optimising weightings in Supplementary Figure S5. We stress that we are open to the idea that other CM methods besides EA could be utilised successfully for ex-situ training in the context of memristive ANNs. However, in this work we focus on demonstrating that CMs can be used to improve the accuracy of memristor-based ANNs in general.

With EA, we find that even when the memristive ANNs, which go into a committee, all 125 use the same digital weights that are mapped onto crossbar arrays (see Figure 1B), committee 126 of memristor-based networks can still achieve higher accuracy than just a single non-ideal 127 network. Although all networks have the same *digital* weights before mapping, their physical 128 implementations (which we call "disturbances" in Figures 1B, C because they can usually 129 be represented by the modification of individual weights) will be different. For example, in 130 one crossbar array, a certain set of devices will be faulty, while in the other crossbar array, it 131 will be a different set. This will result in different physical implementations having slightly 132 different learned representations of the data set, or, to paraphrase, different networks will 133 be "damaged" differently by the non-idealities. This means that these committees will be 134 able to combine different representations, and thus achieve higher accuracy. However, by 135 definition, such approach would almost certainly not yield a committee accuracy that is 136 higher than the accuracy of a single digitally implemented network. 137

A better approach is to use different digital networks for different physical implementa-138 tions that go into a committee (see Figure 1C). This approach much more resembles the 139 conventional application of EA in computer science. In the context of memristive crossbar 140 arrays, it would not only help to mitigate the effects of the non-idealities (as in the case 141 of Figure 1B), but would also allow to combine the representations of digital networks that 142 were different even before the mapping stage. Most importantly, this method allows for a 143 committee to achieve higher accuracy which is sometimes even higher than that of individual 144 networks with digitally implemented weights. We thus used this method in this analysis. 145 An example comparison of these two approaches is presented in Supplementary Figure S8. 146

In this work, any given committee used only one network architecture but each network
was initialised differently before training, thus trained networks had different sets of weights.

Although it was not explored in this work, combining different network architectures in a 149 committee of memristor-based networks might be advantageous. Furthermore, in this work 150 we focus on fully connected ANNs but CMs could be applied to other variants of neural 151 networks as well. Due to the simplicity of EA, it could, for example, be employed in con-152 volutional neural networks (CNNs) [24], which are often used for image classification. This 153 might be of interest as CNNs have been successfully implemented using crossbar arrays re-154 cently [25]. However, crossbar implementations are naturally more suited to fully connected 155 networks, therefore we limit ourselves to this architecture but are open to exploring the 156 effectiveness of EA with memristive CNNs in the future. 157

## 158 B. $Ta/HfO_2$ RRAM

With array-level data available, Ta/HfO<sub>2</sub> experiments provide the most complete picture of device- and system-level non-idealities. In this subsection, we present not only the analysis of faulty devices and D2D variability, but also careful consideration of the line resistance effects. Ta/HfO<sub>2</sub> memristors do not exhibit apparent RTN and overall have excellent retention properties [26], and thus are perfect candidates for inference application.

#### 164 1. Faulty devices and device-to-device variability

The most energy-efficient procedure to modulate the conductance of memristors is by 165 the application of voltage pulses. In an ideal scenario, one would apply identical pulses 166 and observe constant increases in conductance with each pulse. This is rarely the case 167 in practise, but, fortunately, this type of behaviour is more relevant for in-situ training 168 where it is necessary to ensure linear adjustment of ANN's weights [27]. In ex-situ training, 169 conductance verification schemes can be used to program the devices precisely. Because the 170 devices would have to be programmed only once, one can spend additional resources to do so 171 accurately by applying SET (potentiation) and RESET (depression) pulses until a desirable 172 conductance state is achieved. 173

Even with this approach, there remain two obstacles—faulty devices and D2D variability. It is observed in most memristor technologies that at least a small fraction of the devices tends to get stuck in a particular conductance state. Additionally, even if not stuck, different

devices might behave differently; for example, they might have different conductance ranges. 177 Figure 2A shows conductance changes in Ta/HfO<sub>2</sub> RRAM devices (in a  $128 \times 64$  crossbar 178 array) when they are applied with voltage pulses. We can see from the median values 179 that overall the devices' conductance tends to increase as more SET pulses are applied. 180 However, the wider bottom regions of the violin plots indicate that some devices are stuck 181 around high resistance state (HRS) and cannot set entirely no matter how many voltage 182 pulses are applied. There also exist devices that are stuck in low resistance state (LRS), or 183 simply do not span the full conductance range. 184

Figure 2A combines data from multiple SET cycles for each of the memristors, thus it 185 is important to understand how do these devices behave individually. Figures 2B-F show 186 conductance of 5 (out of 8,192) devices over 11 SET and RESET cycles. In the five dia-187 grams, the radial component represents the conductance (in mS) and the angular component 188 represents the number of applied pulses. Figure 2B shows an example of preferable (and 189 typical) device behaviour—conductance changes in a continuous fashion and spans a wide 190 range of conductance values, from  $\sim 0.1 \,\mathrm{ms}$  to  $\sim 1.0 \,\mathrm{ms}$ . Although RESET cycles tend to 191 feature abrupt decreases in conductance, one can always repeat a cycle and exploit the more 192 predictable behaviour of SET cycles. 193

When encoding continuous numbers into crossbar devices' conductances, it is often prefer-194 able to choose a large enough conductance range. Using data from Figure 2A, one could, 195 for example, choose the range between the first and the last median points (from  $\sim 0.1 \,\mathrm{mS}$ 196 to  $\sim 1.0 \text{ mS}$ ). Device, whose behaviour is presented in Figure 2B, could be easily set to any 197 conductance within that range, as we have seen before. On the other hand, device, whose 198 behaviour is presented in Figure 2C, although operating in a predictable fashion, has smaller 199 conductance range. We can see that in all cycles, its conductance does not exceed 0.8 mS. 200 This is an example of D2D variability that can make it difficult to choose optimal operating 201 range and set the conductance of all devices precisely. 202

Device, whose behaviour is presented in Figure 2D, shows high cycle-to-cycle variability. Although that could prove to be a problem in some applications, this specific device might perfectly serve its purpose in ex-situ training of ANNs. We can observe that this device spans the same conductance range as device from Figure 2B, even if in an unpredictable manner. Because all states in the full range are, in theory, achievable, one can cycle the device multiple times until it is set to the required conductance level. Lastly, we have devices whose negative effect is most difficult to mitigate—faulty devices. Figure 2E shows behaviour of a device stuck at high conductance values, while Figure 2F shows behaviour of a device stuck at low conductance values. No matter how many pulses the devices are applied with or how many times they are cycled, they exhibit almost no conductance variation and thus, in most cases, cannot be used to encode information.

Knowing that some devices perform like the ones whose behaviour is shown in Fig-214 ures 2C,E,F, it is important to minimise their negative effect. If the conductance that a 215 device has to be set to is outside that device's range, it is sensible to set it to the closest 216 achievable conductance. Although there is little that can be done about fully stuck memris-217 tors, it is possible to optimise the behaviour of devices like the one in Figure 2C that simply 218 have smaller conductance range. For example, if such a device has to be set to 0.9 mS, one 219 would set it to the highest achievable conductance ( $\sim 0.8 \,\mathrm{mS}$ ). In the following simulations 220 involving faulty devices and D2D variability, operating range between the first and the last 221 median points was used, the devices were chosen randomly from the  $128 \times 64$  crossbar and 222 set to the most desirable states, as described in this paragraph. 223

#### 224 2. Line resistance

The effect of line resistance can be extremely detrimental in many crossbar-based implementations of ANNs. That is especially the case if the crossbars used are large and the resistance of the interconnects is high (compared to memristors' resistance). Because in a neural network many of the inputs are non-zero at any given time, a lot of current accumulates in the bit lines which results in significant voltage drops across the interconnects, and thus the current distribution across the crossbar is affected in a major way.

The Ta/HfO<sub>2</sub> crossbar has shape  $128 \times 64$  and so this shape was chosen for all the simula-231 tions involving line resistance. Even relatively small ANNs of architecture 784(+1):25(+1):10232 would need  $2 \times (785 \times 25 + 26 \times 10) = 39,770$  memristors to be implemented. Even if not 233 all the inputs were used at any given time, it would not be possible to fit all the memristors 234 onto a single crossbar of shape  $128 \times 64$ . To overcome this, we decided to simulate multiple 235 crossbars, each of which would implement a subset of the synaptic weights, but, for a given 236 synaptic layer, would all compute in parallel. Because [785/128] = 7, seven crossbars were 237 used to implement the first synaptic layer; the first crossbar utilized bottom 113 word lines, 238

while the other six crossbars used bottom 112 word lines because  $113 + 6 \times 112 = 785$ . The second synaptic layer was implemented using eighth crossbar utilizing its bottom 26 word lines.

Figure 3A shows an example of how the first synaptic layer of 784(+1):25(+1):10 neural 242 network could be implemented. Specifically, it shows how the first subset of weights would 243 be implemented using one of the crossbars. Because we use proportional mapping scheme, 244 positive and negative weights would be implemented in different bit lines. In Figure 3A, 245 memristors designated to implement positive weights are coloured in blue, memristors desig-246 nated to implement negative weights are coloured in orange and unelectroformed memristors 247 are coloured in black. Because simulations were constrained by experimental data, some of 248 the devices were left unused and assumed to be unelectroformed. In practise, the crossbars 249 could be manufactured to fit the geometry of the ANNs. 250

In each synaptic layer, the corresponding output currents from each of the crossbars would be added together. Additionally, output currents at the bit lines implementing negative weights would be subtracted from the output currents at the neighbouring bit lines (to their left) implementing positive weights. For example, in the example configuration of Figure 3A, output current at the 2<sup>nd</sup> bit line would be subtracted from the output current at the 1<sup>st</sup> bit line, etc.

Unfortunately, even when using multiple smaller crossbars, the interconnects can signif-257 icantly disturb current distribution in the crossbar. Average output current decreases due 258 to line resistance in all seven crossbars of Ta/HfO<sub>2</sub> devices (whose resistance ranges from 259  $\sim 1 \,\mathrm{k\Omega}$  to  $\sim 11 \,\mathrm{k\Omega}$ , and their interconnect resistance is  $0.35 \,\Omega$  and  $0.32 \,\Omega$  in the word and bit 260 lines, respectively), are shown in the heatmap in Figure 3B. We can see that the current 261 decreases can range from  $\sim 12\%$  at the outputs nearest to the applied voltages to  $\sim 16\%$  at 262 the outputs in the rightmost bit lines that are used. In the supplementary information, we 263 provide a possible strategy of mitigating line resistance effects in supervised learning. This 264 scheme was not employed in the simulations described in the main text because we wanted 265 to find out how well the CM method would deal with noticeable line resistance effects. 266

Figure 4 shows the accuracy of individual networks, as well as of their committees; mem-268 ristive ANNs were simulated by taking into account three non-idealities of Ta/HfO<sub>2</sub> crossbar 269 explored earlier—faulty devices, D2D variability and line resistance. As indicated by the 270 yellow box plot in Figure 4, individual networks implemented digitally achieve  $\sim 95.9\%$  me-271 dian accuracy. Networks disturbed to reflect the effect of non-idealities achieve  $\sim 91.0\%$ 272 median accuracy, as indicated by the vermilion box plot. Although that is a substantial 273 drop in accuracy, we see that as more networks are added to the committee, the more the 274 accuracy increases. When 5 networks are used in a committee, median accuracy increases 275 up to  $\sim 95.7\%$ , as indicated by the rightmost green box plot. 276

## 277 C. Ta<sub>2</sub>O<sub>5</sub> RRAM

In order to explore the effectiveness of minimising adverse effects of RTN, we use another memristor technology based on  $Ta_2O_5$ . To investigate RTN, measurements from a single device were considered. To simulate line resistance effects, interconnect resistance from  $Ta/HfO_2$  was used and the same crossbar shape was assumed.

#### 282 1. Random telegraph noise

Memristors often suffer from RTN resulting in a different accuracy at any given instant in time.  $Ta_2O_5$  device was characterised by measuring the current of 8 resistance states multiple times. Figure 5 shows the cumulative probability plots for those resistance states, together with lognormal fits modelling the nature of RTN. One of the things that the figure reveals is that higher resistance states suffer from higher degree of RTN. Fits for every resistance state, together with occurrence rates (see Supplementary Table SII), were used to disturb the weights of ANNs in order to reproduce the effect of RTN.

#### 290 2. Inference accuracy

The results combining RTN and line resistance effects for  $Ta_2O_5$  device are shown in Figure 6. From the difference in median accuracy between yellow and blue box plots, we can

notice that there is a significant drop in accuracy simply due to mapping of weights onto 293 conductances. That is not surprising given that only 8 states were available for mapping. 294 One can also observe that further drop in median accuracy due to non-idealities is not 295 as severe—it drops to  $\sim 94.1\%$ . The RTN disturbance magnitude is limited to < 100% in 296 most cases, which possibly contributes to its smaller effect on accuracy. Additionally,  $Ta_2O_5$ 297 device has much higher resistance (ranging from  $25 \,\mathrm{k\Omega}$  to  $200 \,\mathrm{k\Omega}$ ), thus line resistance is also 298 less of a concern. When non-ideal networks are combined into committees of 5, the median 299 accuracy jumps to  $\sim 96.5\%$ —even higher than the software baseline of individual networks. 300 This reveals additional trend seen in all the simulations performed—the higher the accuracy 301 of the individual non-ideal memristive networks, the higher the accuracy of the committees 302 that they are part of. 303

#### 304 D. aVMCO RRAM

Further, we consider a third memristor technology—one based on aVCMO materials. We test the effects of RTN by considering measurements from a single device. Line resistance effects were simulated by using interconnect resistance and shape of Ta/HfO<sub>2</sub> crossbar array.

## 308 1. Random telegraph noise

Figure 7 shows the cumulative probability plots for 8 resistance states of an aVMCO device suffering from RTN. Like in  $Ta_2O_5$ , we observe that higher resistance states experience RTN of higher magnitude. However, compared to  $Ta_2O_5$ , the RTN magnitude is much more predictable. Fits for each of the 8 resistance states, together with occurrence rates (see Supplementary Table SIII), were used to simulate the effect of RTN in aVMCO-based neural networks.

#### 315 2. Inference accuracy

The results combining RTN and line resistance are shown in Figure 8. As with  $Ta_2O_5$ , we see a large drop due to mapping onto conductances—consequence of very few states available for mapping. More interestingly, the accuracy of individual memristor-based networks with and without non-idealities is almost identical. That is because the occurrence rate of RTN in aVMCO device is small and there is a much smaller probability of RTN having large magnitude. Additionally, resistance of aVMCO device is even higher than that of  $Ta_2O_5$ device—it ranges from  $1 M\Omega$  to  $7.5 M\Omega$ . Therefore, line resistance has even a smaller effect in a hypothetical array of aVMCO devices. Due to median accuracy of individual non-ideal memristor-based networks being higher (~94.6%), the median accuracy of committees is higher too—in committees of size 5 it increases to ~96.7%.

#### 326 III. DISCUSSION

The results from the previous section suggest that the method of using committee ma-327 chines to improve the accuracy of memristive neural networks is technology- and non-ideality-328 agnostic. CMs can mitigate the effects of faulty devices, D2D variability, RTN and line 329 resistance in combination with each other. Although CM method is slightly less effective 330 with large line resistance (see discussion in the supplementary information), in all cases, we 331 observe that the accuracy of individual non-ideal networks largely determines the accuracy 332 of committees. That is consequential because it means that although committees always 333 increase the accuracy, there is still an incentive to optimise the devices and systems that 334 implement these networks—the higher the accuracy of individual networks, the higher the 335 accuracy of the committees. 336

It is also important to consider whether using larger networks, instead of committees of 337 smaller networks, would yield the same results if the same number of synapses (or mem-338 ristors) was used in the large network as in the committee of smaller networks. In our 339 previous work we found that the accuracy of networks before disturbance (which we call 340 "starting accuracy") has a huge effect on the robustness to non-idealities—the larger the 341 starting accuracy, the more robust the networks become [20]. One way to achieve higher 342 starting accuracy is to have larger networks, e.g. if we have a network with one hidden layer, 343 we might increase the number of neurons in that hidden layer, which would likely result in 344 higher accuracy after training and thus higher robustness. 345

Figure 9 shows a comparison of CMs of memristor-based networks disturbed using faulty devices and D2D variability data from Ta/HfO<sub>2</sub> crossbar, when controlled for the total number of memristors that is required to implement them (line resistance was not taken

into account due to long time required to simulate it in large networks). We can observe 349 that committees of two networks, each with 25 hidden neurons, (leftmost data point of 350 the orange curve) achieve  $\sim 0.9\%$  higher median accuracy than individual networks with 351 50 hidden neurons (second data point from the left in the vermilion curve), despite both 352 requiring almost identical total number of memristors. Committees of two networks, each 353 with 100 hidden neurons, (third data point from the left in the orange curve) achieve  $\sim 1.1\%$ 354 higher median accuracy than individual networks with 200 hidden neurons (rightmost data 355 point in the vermilion curve), even though both require almost the same total number of 356 memristors. Even larger improvement is gained when committees of four networks, each with 357 50 hidden neurons, (second data point from the left in the blue curve) are used instead-358 then the accuracy is improved by  $\sim 1.5\%$ , with almost the exact total number of memristors 359 used. 360

For different non-idealities and even different training schemes of the ANNs, the equiv-361 alents of Figure 9 might be different, but there are a few common characteristics in all of 362 them. In all cases, for a given total number of memristors used, there is an optimal number 363 of networks that should be used in a committee. Additionally, we observe that the more 364 severe a non-ideality is, the more apparent the effectiveness of committees becomes. Finally, 365 sometimes the committees (for a fixed total number of memristors) might achieve lower 366 accuracy than individual networks but only if the networks that they replace are very small 367 and the non-ideality is not very detrimental. If the networks that are being replaced with 368 committees of smaller networks, are sufficiently large, the committees will achieve higher 369 accuracy. An example of that is shown in Supplementary Figure S7 where aVMCO device 370 is minimally affected by the non-idealities and so the advantage of committees becomes 371 apparent only when replacing larger networks. 372

The reason why committees work in the context of non-ideal implementations and why 373 they work best when they are used to replace large networks might, to some extent, lie in 374 their training. When it comes to training fully connected networks, their accuracy tends to 375 saturate as more parameters are added. Supplementary Figure S4 shows that networks with 376 50 hidden neurons can be trained to achieve significantly higher accuracy than networks with 377 25 hidden neurons. However, networks with 200 hidden neurons achieve only slightly higher 378 accuracy than networks with 100 hidden neurons. This also means that networks with 200 379 hidden neurons will be only slightly more robust to non-idealities than networks with 100 380

<sup>381</sup> hidden neurons. When such networks are affected by non-idealities, their accuracy drops <sup>382</sup> to similar values but the smaller network can work in a committee with other networks, <sup>383</sup> totalling almost the same number of memristors as the large network, but achieving higher <sup>384</sup> accuracy overall. This is the most likely reason why the committees of smaller networks are <sup>385</sup> effective at dealing with non-idealities, especially when replacing large networks.

In addition to the accuracy improvements, committees can provide flexibility in mem-386 ristive implementations of neural networks. Digital implementations of ANNs have very 387 predictable behaviour due to the precision of digital logic. Analogue implementations, on 388 the other hand, can vary greatly even if they use the same weights before the mapping 389 onto conductances—that is a result of the stochastic nature of memristors that implement 390 these ANNs. The parallel and modular nature of committee machines makes memristive 391 systems much more flexible. For example, if the verification accuracy of one of the ANNs in 392 a memristor-based CM deteriorates below acceptable levels, its outputs could be disabled 393 to ensure higher accuracy of the rest of the committee. 394

Importantly, this introduced parallelism comes at almost no extra cost. For a fixed total number of memristors, a committee of smaller networks, compared to a large individual network, would only require a few additional output and bias neurons, and an averaging functionality, which could potentially be implemented in hardware. For example, an ANN with 50 hidden neurons would require 846 neurons in total, while a committee of two ANNs, each with 25 hidden neurons (and thus requiring almost the same total number of memristors), would require 857 neurons in total.

In summary, our simulations employing experimental data from three different types of 402 memristive devices show that committee machines employing ensemble averaging can be used 403 to mitigate the effects of device- and system-level non-idealities in memristor-based neural 404 networks. EA allows to achieve higher inference accuracy in physically implemented neural 405 networks that suffer from faulty devices, device-to-device variability, random telegraph noise, 406 and even line resistance. This method is a universal way to deal with the most common 407 non-idealities and is straightforward to implement during the fabrication stage. Increased 408 modularity of these memristive neural network systems will increase not only their inference 409 accuracy, but also their robustness and flexibility, even without the need to sacrifice area. 410 Although some level of non-idealities in memristors is unavoidable, CM method allows us 411 to deal with these on the system level and is agnostic to a particular technology or, to some 412

<sup>413</sup> degree, type of the non-ideality.

#### 414 METHODS

## 415 Experiments

Ta/HfO<sub>2</sub> RRAM 1T1R array consists of NMOS transistors fabricated in a commercial 416 fab (feature size of  $2 \,\mu m$ ) and Pt/HfO<sub>2</sub>/Ta devices. The bottom electrode was deposited by 417 evaporation of 20 nm Pt layer on top of a 2 nm tantalum (Ta) adhesive layer; the electrode 418 was patterned by photolitography and a lift-off process. A  $5 \,\mathrm{nm}$  HfO<sub>2</sub> switching layer was 419 deposited by atomic layer deposition using water and tetrakis(dimethylamido)hafnium as 420 precursors at 250 °C. Sputter-deposited Ta of 50 nm thickness followed by 10 nm Pd was 421 used in a lift-off process to serve as the top electrode. The filamentary based  $Ta_2O_5$  device 422 consists of a TiN/4nm stoichiometric  $Ta_2O_5/20$  nm nonstoichiometric  $TaO_x/10$  nm TaN/TiN 423 stack with a cross-sectional area of  $75 \,\mathrm{nm} \times 75 \,\mathrm{nm}$ , while the non-filamentary-based aVMCO 424 has a cross-sectional area of  $135 \,\mathrm{nm} \times 135 \,\mathrm{nm}$  and is composed of a TiN/8 nm amorphous-425 Si/8 nm anatase  $TiO_2/TiN$  stack.  $Ta_2O_5$  and aVMCO devices were fabricated by imec. The 426 detailed fabrication process parameters can be found in references [11, 28, 29] for Ta/HfO<sub>2</sub>, 427 Ta<sub>2</sub>O<sub>5</sub> and aVMCO RRAMs respectively. 428

The conductance of Ta/HfO<sub>2</sub> devices was modulated by applying SET pulses (500  $\mu$ s @ 429 2.5 V and gate voltage increasing from 0.6 V to 1.6 V). After each of the 11 cycles, RESET 430 pulses were applied (5 µs @ 0.9 V increasing to 2.2 V and gate voltage of 5 V). The voltage 431 was being increased linearly throughout the 100 pulses. All electrical tests for  $Ta_2O_5$  and 432 aVMCO devices were done with a Keysight B1500A. The RTN data is extracted by switching 433 the device into 8 uniformly distributed resistance levels between  $25 \,\mathrm{k\Omega}$  and  $200 \,\mathrm{k\Omega}$ , and 8 434 nearly uniformly distributed resistance levels between  $1 M\Omega$  and  $7.5 M\Omega$  with incremental 435 RESET DC sweeps [30] for  $Ta_2O_5$  and aVMCO respectively. RTN measurement is then 436 carried out at each resistance level at a 0.1 V and 3 V read-out for  $\text{Ta}_2\text{O}_5$  and aVMCO 437 respectively, with a sampling time of 2 ms/point and 10,000 sampling point per resistance 438 level for an RTN measurement period of 20 s. 439

#### 440 Simulations

In this work, feed-forward ANNs with fully connected layers and continuous weights were 441 trained to recognise handwritten digits using the MNIST data base. All 60,000 MNIST 442 training images were used during the training stage; training set consisted of 50,000 images 443 and verification set consisted of 10,000 images. All 10,000 test images were used to evaluate 444 the inference accuracy of ANNs. Networks used 784 input neurons representing pixel inten-445 sities of MNIST images of  $28 \times 28$  pixel size, as well as one bias neuron. 10 output neurons 446 were used; they represented the ANNs' predictions of 10 handwritten digits. Hidden layers 447 used sigmoid activation function, while the output layer used softmax activation function. 448 Weights were optimised by minimising cross-entropy error function using stochastic gradi-449 ent descent. Learning rate of 0.01 and patience of 25 epochs were used. 25 networks were 450 trained for each architecture explored by initialising them differently. When numerically op-451 timising ANNs' weightings, optimisation was performed by employing verification set, while 452 the performance was evaluated using the test set. The code was implemented in Python. 453

Weights were mapped onto pairs of memristors' conductances using proportional map-454 ping scheme—synaptic weights were made proportional to one of the conductances in the 455 pair, while the other was left unelectroformed. The zero weight was interpreted as given— 456 in practise, it would be implemented by not electroforming the device, thus resulting in its 457 negligible conductance. Although aVMCO devices do not have electroforming stage, for con-458 sistency we assumed that additional insulating circuit elements could be used to implement 459 the zero weight. Negative weights would be implemented by placing certain memristors in 460 dedicated bit lines of the crossbars whose outputs would be subtracted from the outputs at 461 the corresponding bit lines implementing positive weights. Maximum weights after mapping 462 were optimised separately for each set of network architecture and conductance levels; in 463 each case this was done by excluding a certain proportion,  $p_{\rm L}$ , of weights with largest abso-464 lute values. What  $p_{\rm L}$  values were used for each simulation is summarised in Supplementary 465 Table SI. More details on the mapping procedure can be found in our past work [20]. 466

All non-idealities, except for line resistance, were simulated by disturbing the individual conductances of memristor-based ANNs. To investigate line resistance, nodal analysis was employed. By setting up simultaneous linear equations using Ohm's law and Kirchhoff's current law, those were solved in sparse matrix representation using Python's library scipy.

After simulating memristor non-idealities, committees of different ANNs were composed. 471 Committees used EA, i.e. the outputs of individual networks in a committee were averaged 472 to produce a single output vector. In EA, the output vectors of individual networks can 473 simply be added together (if the weightings of different networks are the same, as we assume 474 in the main text); the label corresponding to the entry with the highest value would be 475 the prediction of the committee. This addition can be performed either in software, or, if 476 the activation function of the last neuronal layer can be implemented physically, it can be 477 performed by adding corresponding currents produced by the circuitry of this activation 478 function. 479

In the simulations, neural networks that go into a committee were chosen randomly. 480 This was done to reflect the most convenient strategy when manufacturing such systems— 481 because one does not need to selectively choose the networks, manufactured crossbars can be 482 easily programmed without the need to replace them if they perform poorly when working 483 individually (unless their effect is so detrimental that they have to be ignored which can 484 be made possible with this technique). Besides, devices might change over time, thus these 485 simulations, which show what happens when one does not selectively choose the networks, 486 are valuable to investigate conditions where it is not possible to replace the networks. 487

In the simulations, 25 base networks were used (each having different set of weights) for 488 each of the architectures. Then all of their weights were mapped onto pairs of conductances 489 using HRS/LRS values extracted from experiments. Finally, to reflect the effect of each of 490 the non-idealities, all networks were disturbed multiple times. In each disturbance iteration, 491 multiple combinations of networks were chosen and their performance as a committee of 492 certain size was evaluated. In total, for most simulations, 10,000 data points were recorded 493 for a committee of every size—these data captured the variations of base networks, their 494 combinations and different disturbance iterations. Only simulations involving line resistance 495 or numerical optimisation of weights had fewer data points for some committee sizes (due 496 to long simulation times). 497

## 498 DATA AVAILABILITY

<sup>499</sup> The data that support the findings of this study are available from the corresponding <sup>500</sup> author upon reasonable request.

#### 501 AUTHOR CONTRIBUTIONS

A.M. and D.J. conceived the idea and designed the study. A.M., P.F. and Z.C. performed the experimental measurements. D.J. performed the simulations and analysed the experimental and simulation results. C.L. and Q.X. provided the experimental data of the programming of a Ta/HfO<sub>2</sub> 1T1R RRAM array. A.M., W.D.Z. and A.J.K. supervised the research. D.J. wrote the initial manuscript. All authors contributed to the discussions of the results and improved the text.

## 508 COMPETING INTERESTS STATEMENT

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### 511 FUNDING

A.M. acknowledges funding from the Royal Academy of Engineering under the Research Fellowship scheme, A.J.K. acknowledges funding from the Engineering and Physical Sciences Research Council (EP/P013503/1) and the Leverhulme Trust (RPG-2016-135), W.D.Z. acknowledges funding from the Engineering and Physical Sciences Research Council (EP/S000259/1).

- [1] E. Strubell, A. Ganesh, and A. McCallum, "Energy and policy considerations for deep learning
   in NLP," arXiv preprint arXiv:1906.02243, 2019.
- [2] S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with
- pruning, trained quantization and huffman coding," in *International Conference on Learning Representations*, 2016, San Juan (Puerto Rico), arXiv preprint arXiv:1510.00149.
- 522 [3] C. Li, Z. Wang, M. Rao, D. Belkin, W. Song, H. Jiang, P. Yan, Y. Li, P. Lin, M. Hu, N. Ge,
- J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, "Long short-term memory networks in memristor crossbar arrays," *Nature Machine Intelligence*, vol. 1, no. 1, pp. 49–57, 2019, doi: 10.1038/s42256-018-0001-4.
- [4] Z. Wang, C. Li, W. Song, M. Rao, D. Belkin, Y. Li, P. Yan, H. Jiang, P. Lin, M. Hu, J. P.
  Strachan, N. Ge, M. Barnell, Q. Wu, A. G. Barto, Q. Qiu, R. S. Williams, Q. Xia, and J. J.
  Yang, "Reinforcement learning with analogue memristor arrays," *Nature Electronics*, vol. 2, no. 3, p. 115, 2019, doi: 10.1038/s41928-019-0221-6.
- [5] Z. Sun, G. Pedretti, E. Ambrosi, A. Bricalli, W. Wang, and D. Ielmini, "Solving matrix
  equations in one step with cross-point resistive arrays," *Proceedings of the National Academy*of Sciences, vol. 116, no. 10, pp. 4123–4128, 2019, doi: 10.1073/pnas.1815682116.
- [6] S. R. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou,
- "A phase-change memory model for neuromorphic computing," Journal of Applied Physics,
  vol. 124, no. 15, p. 152135, 2018, doi: 10.1063/1.5042408.
- [7] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. D. Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. P. Farinha, B. Killeen, C. Cheng, Y. Jaoudi, and G. W. Burr,
  "Equivalent-accuracy accelerated neural-network training using analogue memory," *Nature*,
  vol. 558, no. 7708, pp. 60–67, 2018, doi: 10.1038/s41586-018-0180-5.
- 540 [8] S. Yu, Z. Li, P. Y. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian, "Binary neu-
- ral network with 16 Mb RRAM macro chip for classification and online training," in *In- ternational Electron Devices Meeting.* IEEE, 2016, San Francisco (United States), doi:
  10.1109/IEDM.2016.7838429.
- [9] J. Woo, K. Moon, J. Song, S. Lee, M. Kwak, J. Park, and H. Hwang, "Improved synaptic behavior under identical pulses using  $AlO_x/HfO_2$  bilayer RRAM array for neuromor-

- phic systems," *IEEE Electron Device Letters*, vol. 37, no. 8, pp. 994–997, 2016, doi:
   10.1109/LED.2016.2582859.
- [10] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B.
  Strukov, "Training and operation of an integrated neuromorphic network based on metaloxide memristors," *Nature*, vol. 521, no. 7550, pp. 61–64, 2015, doi: 10.1038/nature14441.
- [11] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery, P. Lin, Z. Wang,
  W. Song, J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, "Efficient and self-adaptive in-situ learning in multilayer memristor neural networks," *Nature*
- *communications*, vol. 9, no. 1, p. 2385, 2018, doi: 10.1038/s41467-018-04484-2.
- <sup>555</sup> [12] A. Chen and M. R. Lin, "Variability of resistive switching memories and its impact on cross-
- bar array performance," in 2011 International Reliability Physics Symposium. IEEE, 2011,
  Monterey (United States), doi: 10.1109/IRPS.2011.5784590.
- J. Kang, Z. Yu, L. Wu, Y. Fang, Z. Wang, Y. Cai, Z. Ji, J. Zhang, R. Wang, and Y. Yang,
  "Time-dependent variability in RRAM-based analog neuromorphic system for pattern recognition," in *International Electron Devices Meeting*. IEEE, 2017, San Francisco (United States),
  doi: 10.1109/IEDM.2017.8268340.
- [14] L. Xia, W. Huangfu, T. Tang, X. Yin, K. Chakrabarty, Y. Xie, Y. Wang, and H. Yang,
  "Stuck-at fault tolerance in RRAM computing systems," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 8, no. 1, pp. 102–115, 2017, doi: 10.1109/JETCAS.2017.2776980.
- <sup>566</sup> [15] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Dávila, C. E.
  <sup>567</sup> Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell, Q. Wu, S. Williams, J. Yang,
  <sup>568</sup> and Q. Xia, "Analogue signal and image processing with large memristor crossbars," *Nature*<sup>569</sup> Electronics, vol. 1, no. 1, pp. 52–59, 2018, doi: 10.1038/s41928-017-0002-z.
- <sup>570</sup> [16] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni,
- and E. Eleftheriou, "Mixed-precision in-memory computing," *Nature Electronics*, vol. 1, no. 4,
  p. 246, 2018, doi: 10.1038/s41928-018-0054-8.
- 573 [17] M. Hu, J. P. Strachan, Z. Li, and S. R. William, "Dot-product engine as computing mem-
- ory to accelerate machine learning algorithms," in 17th International Symposium on Quality
- 575 Electronic Design, 2016, Santa Clara (United States), doi: 10.1109/ISQED.2016.7479230.

- <sup>576</sup> [18] Q. Xia and J. J. Yang, "Memristive crossbar arrays for brain-inspired computing," *Nature* <sup>577</sup> *materials*, vol. 18, no. 4, p. 309, 2019, doi: 10.1038/s41563-019-0291-x.
- <sup>578</sup> [19] Y. LeCun, C. Cortes, and C. J. C. Burges, "The MNIST database of handwritten digits,"
  <sup>579</sup> 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist
- [20] A. Mehonic, D. Joksas, W. H. Ng, M. Buckwell, and A. J. Kenyon, "Simulation of inference accuracy using realistic RRAM devices," *Frontiers in Neuroscience*, vol. 13, p. 593, 2019, doi: 10.3389/fnins.2019.00593.
- [21] M. P. Perrone and L. N. Cooper, "When networks disagree: Ensemble methods for hybrid
  neural networks," in Artificial Neural Networks for Speech and Vision. Chapman and Hall,
  1993, pp. 126–142.
- [22] S. Hashem and B. Schmeiser, "Improving model accuracy using optimal linear combinations of trained neural networks," *IEEE Transactions on Neural Networks*, vol. 6, no. 3, pp. 792–794, 1995, doi: 10.1109/72.377990.
- [23] B. Li, L. Xia, P. Gu, Y. Wang, and H. Yang, "Merging the interface: Power, area and accuracy
   co-optimization for RRAM crossbar-based mixed-signal computing system," in *Proceedings of* the 52nd Annual Design Automation Conference, 2015, San Francisco (United States), doi:
   10.1145/2744769.2744870.
- [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional
   neural networks," in Advances in neural information processing systems, 2012, pp. 1097–1105,
   Lake Tahoe (United States), doi: 10.1145/3065386.
- <sup>596</sup> [25] Z. Wang, C. Li, P. Lin, M. Rao, Y. Nie, W. Song, Q. Qiu, Y. Li, P. Yan, J. P. Strachan,
  <sup>597</sup> N. Ge, N. McDonald, Q. Wu, M. Hu, H. Wu, R. S. Williams, Q. Xia, and J. J. Yang, "In situ
  <sup>598</sup> training of feed-forward and recurrent convolutional memristor networks," *Nature Machine*<sup>599</sup> *Intelligence*, vol. 1, no. 9, pp. 434–442, 2019, doi: 10.1038/s42256-019-0089-1.
- [26] H. Jiang, L. Han, P. Lin, Z. Wang, M. H. Jang, Q. Wu, M. Barnell, J. J. Yang, H. L. Xin, and
  Q. Xia, "Sub-10 nm ta channel responsible for superior performance of a HfO<sub>2</sub> memristor,"
- 602 Scientific reports, vol. 6, p. 28525, 2016, doi: 10.1038/srep28525.
- 603 [27] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S. Shenoy,
- P. Narayanan, K. Virwani, E. U. Giacometti, B. N. Kurdi, and H. Hwang, "Experimental demonstration and tolerancing of a large-scale neural network (165 000 synapses) using phase-change memory as the synaptic weight element," *IEEE Transactions on Electron De*-

- vices, vol. 62, no. 11, pp. 3498–3507, 2015, doi: 10.1109/TED.2015.2439635.
- [28] Y. Fan, L. Zhang, D. Crotti, T. Witters, M. Jurczak, and B. Govoreanu, "Direct evidence
  of the overshoot suppression in Ta<sub>2</sub>O<sub>5</sub>-based resistive switching memory with an integrated
  access resistor," *IEEE Electron Device Letters*, vol. 36, no. 10, pp. 1027–1029, 2015, doi:
  10.1109/LED.2015.2470081.
- [29] B. Govoreanu, D. Crotti, S. Subhechha, L. Zhang, Y. Chen, S. Clima, V. Paraschiv, H. Hody,
  C. Adelmann, M. Popovici, O. Richard, and M. Jurczak, "A-VMCO: A novel forming-free, selfrectifying, analog memory cell with low-current operation, nonfilamentary switching and excel-
- lent variability," in Symposium on VLSI Technology, 2015, Kyoto (Japan), doi: 10.1109/VL SIT.2015.7223717.
- 617 [30] Z. Chai, W. Zhang, P. Freitas, F. Hatem, J. F. Zhang, J. Marsland, B. Govoreanu, L. Goux,
- G. S. Kar, S. Hall, P. Chalker, and J. Robertson, "The over-reset phenomenon in Ta<sub>2</sub>O<sub>5</sub>
  RRAM device investigated by the RTN-based defect probing technique," *IEEE Electron Device Letters*, vol. 39, no. 7, pp. 955–958, 2018, doi: 10.1109/LED.2018.2833149.
- [31] C. Sung, S. Lim, H. Kim, T. Kim, K. Moon, J. Song, J.-J. Kim, and H. Hwang, "Effect
  of conductance linearity and multi-level cell characteristics of TaO<sub>x</sub>-based synapse device on
  pattern recognition accuracy of neuromorphic system," *Nanotechnology*, vol. 29, no. 11, p.
  115203, 2018, doi: 10.1088/1361-6528/aaa733.
- [32] Y. Fang, Z. Yu, Z. Wang, T. Zhang, Y. Yang, Y. Cai, and R. Huang, "Improvement of HfO<sub>x</sub>based RRAM device variation by inserting ALD TiN buffer layer," *IEEE Electron Device Letters*, vol. 39, no. 6, pp. 819–822, 2018, doi: 10.1109/LED.2018.2831698.
- [33] B. Govoreanu, A. Redolfi, L. Zhang, C. Adelmann, M. Popovici, S. Clima, H. Hody,
  V. Paraschiv, I. Radu, A. Franquet, J. C. Liu, J. Swerts, O. Richard, H. Bender, L. Altimime,
  and M. Jurczak, "Vacancy-modulated conductive oxide resistive RAM (VMCO-RRAM): An
  area-scalable switching current, self-compliant, highly nonlinear and wide on/off-window resistive switching cell," in *International Electron Devices Meeting*. IEEE, 2013, Washington
  (United States), doi: 10.1109/IEDM.2013.6724599.
- [34] A. J. Kenyon, M. S. Munde, W. H. Ng, M. Buckwell, D. Joksas, and A. Mehonic, "The
   interplay between structure and function in redox-based resistance switching," *Faraday Dis- cussions*, vol. 213, pp. 151–163, 2019, doi: 10.1039/C8FD00118A.

- [35] W. Wu, H. Wu, B. Gao, P. Yao, X. Zhang, X. Peng, S. Yu, and H. Qian, "A methodology to improve linearity of analog RRAM for neuromorphic computing," in *Symposium on VLSI Technology*. IEEE, 2018, Honolulu (United States), doi: 10.1109/VLSIT.2018.8510690.
- 640 [36] Z. Chai, P. Freitas, W. Zhang, F. Hatem, J. F. Zhang, J. Marsland, B. Govoreanu, L. Goux,
- and G. S. Kar, "Impact of RTN on pattern recognition accuracy of RRAM-based synaptic
- neural network," *IEEE Electron Device Letters*, vol. 39, no. 11, pp. 1652–1655, 2018, doi:
- 643 10.1109/LED.2018.2869072.



**Figure 1**. Using multiple neural networks to improve inference accuracy. **A**) The principle of EA. **B**) Using identical digital networks when implementing committees of memristive neural networks only helps to deal with the damage to the networks caused by the non-idealities. **C**) Using different digital networks when implementing committees of memristive neural networks both helps to deal with the damage to the networks caused by the non-idealities and allows to combine the knowledge about the data set acquired by individual digital networks.



Figure 2. Experimental data of Ta/HfO<sub>2</sub> RRAM crossbar array of shape  $128 \times 64$ . A) Modulation of devices' conductance over 11 SET cycles, each consisting of a 100 potentiating pulses. Violin plots of gradual conductance changes are shown for all Ta/HfO<sub>2</sub> devices, with dots representing median conductance after a certain number of pulses. 100 points were used for Gaussian kernel density estimation. All violin plots have their maximum widths normalised. **B-F**) Examples of devices with their conductance (in mS) **B**) spanning the full range, **C**) spanning part of the full range, **D**) exhibiting cycle-to-cycle variability, **E**) stuck at high values, **F**) stuck at low values. These diagrams show conductance of five devices from Ta/HfO<sub>2</sub> crossbar array over 11 SET and RESET cycles. The radial component represents the conductance, while the angular component represents the number of applied pulses. The first SET cycle starts at the top of each of the diagrams. The conductance (in blue) over 100 SET pulses is displayed in a clockwise fashion across the right half of each of the diagrams. Following that, conductance (in orange) over 100 RESET pulses (starting at the bottom) is displayed across the left half of each of the diagrams, after which the next cycle is displayed. Cartesian version of these plots is shown in Supplementary Figure S9.



Figure 3. Theoretical implementation of a synaptic layer of shape  $785 \times 25$  using crossbars of shape  $128 \times 64$ . A) Mapping the first subset of weights onto one of the seven crossbars used to implement the whole synaptic layer. Positive weights and negative weights are mapped onto memristors in different bit lines. B) Heatmap of average changes in output currents due to line resistance (in all seven Ta/HfO<sub>2</sub> crossbars). For this particular simulation, it was assumed that Ta/HfO<sub>2</sub> devices can be programmed perfectly.



Figure 4. Accuracy achieved by individual networks and their committees when faulty devices, D2D variability data and line resistance of Ta/HfO<sub>2</sub> crossbar are taken into account. The maximum whisker length is set to  $1.5 \times IQR$ .



Figure 5. Cumulative probability plots of RTN-induced relative current deviations for all 8 resistance states of a  $Ta_2O_5$  RRAM device. Lognormal fits are shown for each resistance state.



Figure 6. Accuracy achieved by individual networks and their committees when RTN data of a  $Ta_2O_5$  device are taken into account. Additionally, interconnect resistance of  $0.35 \Omega$  and  $0.32 \Omega$  in the word and bit lines, respectively, (from Ta/HfO<sub>2</sub> array) was used to include line resistance effects. The maximum whisker length is set to  $1.5 \times IQR$ .



**Figure 7**. Cumulative probability plots of RTN-induced relative current deviations for all 8 resistance states of aVMCO RRAM device. Lognormal fits are shown for each resistance state.



Figure 8. Accuracy achieved by individual networks and their committees when RTN data of an aVMCO device are taken into account. Additionally, interconnect resistance of  $0.35 \Omega$  and  $0.32 \Omega$  in the word and bit lines, respectively, (from Ta/HfO<sub>2</sub> array) was used to include line resistance effects. The maximum whisker length is set to  $1.5 \times IQR$ .



Figure 9. Median accuracy achieved by individual one-hidden-layer memristor-based networks and their committees, when controlled for total number of memristors required. The networks contained 25, 50, 100 or 200 hidden neurons and were disturbed using faulty devices and D2D variability data from  $Ta/HfO_2$  crossbar.

| First author<br>(year)      | Non-ideality                  | Device type                         | Proposed solution                                                                      |
|-----------------------------|-------------------------------|-------------------------------------|----------------------------------------------------------------------------------------|
| C. Sung<br>(2018) [31]      | Current/voltage non-linearity | $TaO_x RRAM$                        | Hot-forming step is adopted                                                            |
| C. Li<br>(2018) [15]        | Current/voltage non-linearity | Ta/HfO <sub>2</sub> RRAM            | 1T1R architecture is adopted                                                           |
| Y. Fang<br>(2018) [32]      | Device-to-device variability  | $\mathrm{HfO}_{x} \mathrm{RRAM}$    | Ultra-thin ALD-TiN<br>buffer layer is introduced                                       |
| B. Govoreanu<br>(2013) [33] | Device-to-device variability  | $Al_2O_3/TiO_2$ (VMCO) RRAM         | Non-filamentary RRAM is adopted                                                        |
| A. J. Kenyon<br>(2019) [34] | Device-to-device variability  | $\mathrm{SiO}_x$ RRAM               | The roughness of bottom<br>electrodes is increased                                     |
| L. Xia<br>(2017) [14]       | Faulty devices                | -                                   | A modified mapping algorithm<br>and redundancy schemes are used                        |
| S. Ambrogio<br>(2018) [7]   | Limited dynamic range         | PCM                                 | Two pairs of conductance of varying significance<br>for every synaptic weight are used |
| M. Hu<br>(2016) [17]        | Line resistance               | -                                   | Advanced mapping algorithms are used to<br>compensate for line resistance effects      |
| W. Wu<br>(2018) [35]        | Programming non-linearity     | $HfO_x RRAM$                        | Electro-thermal modulation layer is<br>deposited on the switching layer                |
| J. Woo<br>(2016) [9]        | Programming non-linearity     | HfO <sub>2</sub> RRAM               | Bilayer structure is adopted                                                           |
| S. Ambrogio<br>(2018) [7]   | Programming non-linearity     | PCM                                 | PCM devices are used together<br>with CMOS transistors                                 |
| Z. Chai<br>(2018) [36]      | Random telegraph noise        | TiO <sub>2</sub> /a-Si (aVMCO) RRAM | Non-filamentary RRAM is adopted                                                        |

**Table I**. Examples of past efforts at dealing with non-idealities of memristive devices and their systems.

| 1  | Committee Machines—A Universal Method to Deal with                                                                           |
|----|------------------------------------------------------------------------------------------------------------------------------|
| 2  | Non-Idealities in Memristor-Based Neural Networks                                                                            |
| 3  | D. Joksas <sup>1</sup> , P. Freitas <sup>2</sup> , Z. Chai <sup>2</sup> , W. H. Ng <sup>1</sup> , M. Buckwell <sup>1</sup> , |
| 4  | C. $Li^3$ , W. D. Zhang <sup>2</sup> , Q. Xia <sup>3</sup> , A. J. Kenyon <sup>1</sup> , and A. Mehonic <sup>1</sup>         |
| 5  | <sup>1</sup> Department of Electronic and Electrical Engineering,                                                            |
| 6  | University College London, London (United Kingdom)                                                                           |
| 7  | <sup>2</sup> Department of Electronics and Electrical Engineering,                                                           |
| 8  | Liverpool John Moores University, Liverpool (United Kingdom)                                                                 |
| 9  | <sup>3</sup> Department of Electrical and Computer Engineering,                                                              |
| 10 | University of Massachusetts Amherst (United States of America)                                                               |
| 11 | Abstract                                                                                                                     |
| 12 | Artificial neural networks are notoriously power- and time-consuming when implemented on con-                                |
| 13 | ventional von Neumann computing systems. Recent Consequently, recent years have seen an emer-                                |
| 14 | gence of research in <u>machine learning</u> hardware that strives to break the bottleneck of von Neumann                    |
| 15 | architecture and optimise the data flow, namely, to bring memory and computing closer together.                              |
| 16 | One of the most often suggested solutions is the physical implementation of <u>A popular approach</u>                        |
| 17 | is to realise artificial neural networks in which hardware by implementing their synaptic weights                            |
| 18 | are realised with memristive devices, such as resistive random-access memoryusing memristive                                 |
| 19 | devices. However, various device- and system-level non-idealities usually prevent these physical                             |
| 20 | implementations from achieving high inference accuracy. We suggest applying a well-known con-                                |
| 21 | $cept in computer science-committee \underline{machine-in-machines-in} the context of memristor-based$                       |
| 22 | neural networks. Using simulations and experimental data from three different types of mem-                                  |
| 23 | ristive devices, we show that committee machines employing ensemble averaging can successfully                               |
| 24 | increase inference accuracy in physically implemented neural networks that suffer from faulty de-                            |
| 25 | vices, device-to-device variability, random telegraph noise and line resistance. Importantly, we                             |
| 26 | show_demonstrate that the accuracy can be improved even without increasing the total number of                               |
| 27 | memristors.                                                                                                                  |

#### 28 I. INTRODUCTION

Artificial neural networks (ANNs), with all of their variants, are now the main tools in 29 machine learning tasks, such as classification. The vast amounts of data being constantly 30 produced have enabled successful training and operation of ANNs. However, to achieve 31 high inference accuracy, it is usually necessary for neural networks to have a large number of 32 parameters. This results in both training [1] and inference [2] stages being time- and power-33 consuming. This is largely caused by the need to transfer data from memory to computing 34 units—physical separation of memory and computing is the essence of any von Neumann 35 system. 36

One of the most promising solutions to these problems is the paradigm of non-von Neu-37 mann computing and, specifically, analogue implementations of synapses (weights) in phys-38 ical ANNs. Because there are many more synapses than there are neurons in ANNs, the 39 matrix-vector multiplications, in which the synaptic weight values are used, are the costli-40 est operations in these networks, both in terms of power and time. Computing directly in 41 memory would minimise costly data transfers from off-chip memory, thus the most popular 42 approach is using analogue memory devices as proxies for synaptic weights of ANNs (both 43 fully connected and their variants [3, 4]). A common technique is to arrange such devices 44 in a structure, called crossbar array, in which every device (or a pair of devices) is used to 45 represent a single synaptic weight or, more generally, an entry in a matrix [5]. Memristive 46 devices, such as phase-change memories (PCMs) [6, 7] or resistive random-access memories 47 (RRAMs) [8, 9], have been considered as candidates for such tasks. Although here we fo-48 cus on ex-situ training, such systems have been successfully utilised for in-situ training too 49 [10, 11].50

In memristive implementations of ANNs, the main concern is that various non-idealities 51 associated with these devices can prevent these systems from achieving high accuracy [12, 52 13]. Examples of non-idealities affecting inference accuracy include, but are not limited 53 to, devices not being able to electroform, devices stuck in one of the resistance states after 54 electroforming, device-to-device (D2D) variability and random telegraph noise (RTN). When 55 training analogue systems in-situ, limited endurance and non-linear resistance modulation 56 too have to be taken into account. To mitigate the effects of these device non-idealities, it is 57 often necessary to modify device structure [9], to use more advanced programming schemes 58

<sup>59</sup> [14] or to use additional circuitry [15] or high-precision processing units [16] in conjunction <sup>60</sup> with memristive elements. On the system level, there is an issue of line resistance which <sup>61</sup> affects the distribution of currents and thus decreases the accuracy. These line resistance <sup>62</sup> effects can be partially compensated for algorithmically [17] or partially mitigated by using <sup>63</sup> multiple smaller crossbar arrays [18]. Examples of past efforts at dealing with these and <sup>64</sup> other non-idealities of memristive devices and systems are listed in Table I; most of these <sup>65</sup> non-idealities are still the main focus of the research in the neuromorphic community.

We propose a simple way to mitigate the effects of all types of non-idealities during inference. We suggest combining several non-ideal memristor-based neural networks into committees to achieve better accuracy. The committee machine (CM) method we propose significantly increases the inference accuracy and does not increase the computation time because memristive ANNs in such committees work in parallel.

In this work, we firstly explain the simulation setup—what networks were trained, 71 how they were simulated and how they were combined into CMs. After that, follows 72 the experimental part. We investigate three different types of memristor technology— 73 tantalum/hafnium oxide-based ( $Ta/HfO_2$ ), tantalum oxide-based ( $Ta_2O_5$ ), and amorphous 74 vacancy modulated conductive oxide-based (aVMCO) devices. By exploring their non-75 idealities relevant to inference—faulty devices, D2D variability, RTN, and line resistance— 76 we use the experimental data to simulate memristive ANNs working individually and in 77 committees. 78

### 79 II. RESULTS

#### **A.** Simulation setup

Fully connected ANNs were trained in software to recognise handwritten digits (using MNIST data base [19]). Architectures with one hidden layer were explored. Unless stated otherwise, the simulations used networks with 25 hidden neurons. However, networks with 50, 100 and 200 hidden neurons were additionally employed to evaluate the effectiveness of the proposed method while controlling for the total number of memristors required. Following training, weights of ANNs were mapped onto pairs of conductances using proportional mapping scheme (see [20]) to simulate memristor-based ANNs. Finally, these memristive networks were disturbed using experimental data to reflect the effect of device- and systemlevel non-idealities.

After simulating physical non-idealities, the networks were combined into CMs that employed ensemble averaging (EA) [21]. The principle of EA is shown in Figure 1A—several networks are combined in parallel and then their outputs are averaged. After that, the prediction is made using the averaged vector—the prediction is the label corresponding to the largest entry in the vector.

CM methods are frequently used even with conventional ANNs. Methods, such as EA, 95 often produce better accuracy than that of the best individual network in a committee [22]. 96 Although there are other types of CMs besides EA, they often rely on training additional 97 gating networks or boosting networks during the training stage. Using a gating network in 98 this scenario would produce additional problems—to avoid it acting as a performance bottle-99 neck, it too would have to be implemented on crossbar arrays. Various non-idealities would 100 decrease the effectiveness of this gating network which is responsible for making the deci-101 sions about the whole committee of ANNs. Likewise, we speculate that boosting of networks 102 would not be feasible in ex-situ training because it requires information about where indi-103 vidual ANNs perform poorly—this cannot be known precisely until they are implemented 104 physically on crossbar arrays and the non-idealities manifest themselves. To authors' best 105 knowledge, the application of boosting in the context of memristive neural networks seems 106 to have been explored only once before [23]; as expected, it requires training each memristive 107 implementation differently because non-idealities manifest themselves differently in different 108 crossbar arrays. 109

There exist modifications of EA algorithm that could potentially perform better. One 110 example of this is generalized ensemble method (GEM) which, instead of using equal weight-111 ings for each network during averaging (as in EA), uses a different one for each network [21]. 112 These weightings are analytically determined by considering correlation of errors between 113 different networks. But because [21] only considered networks with mean square error loss 114 function (while our networks used cross-entropy loss function), this work does not explore 115 GEM. Instead, we investigated whether it is possible to achieve a better performance by 116 optimising the weightings numerically. This method, like GEM and others previously men-117 tioned, might be impractical because, firstly, these weightings could be determined only after 118 the ANNs are physically implemented on crossbars, and, secondly, the devices could change 119

<sup>120</sup> throughout their lifetimes thus affecting the optimal weightings.

Even with the assumption that the devices would have perfect retention, we found that optimisation of weightings achieves effectively the same performance. Because of these reasons, we focus only on EA in the main text, but present our results of optimising weightings in Supplementary Figure S3S5. We stress that we are open to the idea that other CM methods besides EA could be utilised successfully for ex-situ training in the context of memristive ANNs. However, in this work we focus on demonstrating that CMs can be used to improve the accuracy of memristor-based ANNs in general.

With EA, we find that even when the memristive ANNs, which go into a committee, all 128 use the same digitally implemented digital weights that are mapped onto crossbar arrays 129 (see Figure 1B), committee of memristor-based networks can still achieve higher accuracy 130 than just a single non-ideal network. Although all networks have the same *digital* weights 131 before mapping, their physical implementations (which we call "disturbances" in Figures 1B, 132 C because they can usually be represented by the modification of individual weights) will 133 be different. For example, in one crossbar array, a certain set of devices will be faulty, while 134 in the other crossbar array, it will be a different set. This will result in different physical 135 implementations having slightly different learned representations of the data set, or, to 136 paraphrase, different networks will be "damaged" differently by the non-idealities. This 137 means that these committees will be able to combine different representations, and thus 138 achieve higher accuracy. However, by definition, such approach would almost certainly not 139 yield a committee accuracy that is higher than the accuracy of a single digitally implemented 140 network. 141

A better approach is to use different digital networks for different physical implementa-142 tions that go into a committee (see Figure 1C). This approach much more resembles the 143 conventional application of EA in computer science. In the context of memristive crossbar 144 arrays, it would not only help to mitigate the effects of the non-idealities (as in the case 145 of Figure 1B), but would also allow to combine the representations of digital networks that 146 were different even before the mapping stage. Most importantly, this method allows for a 147 committee to achieve higher accuracy which is sometimes even higher than that of individual 148 networks with digitally implemented weights. We thus used this method in this analysis. 149 An example comparison of these two approaches is presented in Supplementary Figure S8. 150 In this work, any given committee used only one network architecture but each network 151

was initialised differently before training, thus trained networks had different sets of weights. 152 Although it was not explored in this work, combining different network architectures in a 153 committee of memristor-based networks might be advantageous. Furthermore, in this work 154 we focus on fully connected ANNs but CMs could be applied to other variants of neural 155 networks as well. Due to the simplicity of EA, it could, for example, be employed in con-156 volutional neural networks (CNNs) [24], which are often used for image classification. This 157 might be of interest as CNNs have been successfully implemented using crossbar arrays re-158 cently [25]. However, crossbar implementations are naturally more suited to fully connected 159 networks, therefore we limit ourselves to this architecture but are open to exploring the 160 effectiveness of EA with memristive CNNs in the future. 161

## 162 B. $Ta/HfO_2$ RRAM

With array-level data available, Ta/HfO<sub>2</sub> experiments provide the most complete picture of device- and system-level non-idealities. In this subsection, we present not only the analysis of faulty devices and D2D variability, but also careful consideration of the line resistance effects. Ta/HfO<sub>2</sub> memristors do not exhibit apparent RTN and overall have excellent retention properties [26], and thus are perfect candidates for inference application.

## 168 1. Faulty devices and device-to-device variability

The most energy-efficient procedure to modulate the conductance of memristors is by 169 the application of voltage pulses. In an ideal scenario, one would apply identical pulses 170 and observe constant increases in conductance with each pulse. This is rarely the case 171 in practise, but, fortunately, this type of behaviour is more relevant for in-situ training 172 where it is necessary to ensure linear adjustment of ANN's weights [27]. In ex-situ training, 173 conductance verification schemes can be used to program the devices precisely. Because the 174 devices would have to be programmed only once, one can spend additional resources to do so 175 accurately by applying SET (potentiation) and RESET (depression) pulses until a desirable 176 conductance state is achieved. 177

Even with this approach, there remain two obstacles—faulty devices and D2D variability. It is observed in most memristor technologies that at least a small fraction of the devices

tends to get stuck in a particular conductance state. Additionally, even if not stuck, different 180 devices might behave differently; for example, they might have different conductance ranges. 181 Figure 2A shows conductance changes in Ta/HfO<sub>2</sub> RRAM devices (in a  $128 \times 64$  crossbar 182 array) when they are applied with voltage pulses. We can see from the median values 183 that overall the devices' conductance tends to increase as more SET pulses are applied. 184 However, the wider bottom regions of the violin plots indicate that some devices are stuck 185 around high resistance state (HRS) and cannot set entirely no matter how many voltage 186 pulses are applied. There also exist devices that are stuck in low resistance state (LRS), or 187 simply do not span the full conductance range. 188

Figure 2A combines data from multiple SET cycles for each of the memristors, thus it 189 is important to understand how do these devices behave individually. Figures 2B-F show 190 conductance of 5 (out of 8,192) devices over 11 SET and RESET cycles. In the five dia-191 grams, the radial component represents the conductance (in mS) and the angular component 192 represents the number of applied pulses. Figure 2B shows an example of preferable (and 193 typical) device behaviour—conductance changes in a continuous fashion and spans a wide 194 range of conductance values, from  $\sim 0.1 \,\mathrm{ms}$  to  $\sim 1.0 \,\mathrm{ms}$ . Although RESET cycles tend to 195 feature abrupt decreases in conductance, one can always repeat a cycle and exploit the more 196 predictable behaviour of SET cycles. 197

When encoding continuous numbers into crossbar devices' conductances, it is often prefer-198 able to choose a large enough conductance range. Using data from Figure 2A, one could, 199 for example, choose the range between the first and the last median points (from  $\sim 0.1 \,\mathrm{mS}$ 200 to  $\sim 1.0 \,\mathrm{mS}$ ). Device, whose behaviour is presented in Figure 2B, could be easily set to any 201 conductance within that range, as we have seen before. On the other hand, device, whose 202 behaviour is presented in Figure 2C, although operating in a predictable fashion, has smaller 203 conductance range. We can see that in all cycles, its conductance does not exceed 0.8 mS. 204 This is an example of D2D variability that can make it difficult to choose optimal operating 205 range and set the conductance of all devices precisely. 206

Device, whose behaviour is presented in Figure 2D, shows high cycle-to-cycle variability. Although that could prove to be a problem in some applications, this specific device might perfectly serve its purpose in ex-situ training of ANNs. We can observe that this device spans the same conductance range as device from Figure 2B, even if in an unpredictable manner. Because all states in the full range are, in theory, achievable, one can cycle the device multiple times until it is set to the required conductance level.

Lastly, we have devices whose negative effect is most difficult to mitigate—faulty devices. Figure 2E shows behaviour of a device stuck at high conductance values, while Figure 2F shows behaviour of a device stuck at low conductance values. No matter how many pulses the devices are applied with or how many times they are cycled, they exhibit almost no conductance variation and thus, in most cases, cannot be used to encode information.

Knowing that some devices perform like the ones whose behaviour is shown in Fig-218 ures 2C,E,F, it is important to minimise their negative effect. If the conductance that a 219 device has to be set to is outside that device's range, it is sensible to set it to the closest 220 achievable conductance. Although there is little that can be done about fully stuck memris-221 tors, it is possible to optimise the behaviour of devices like the one in Figure 2C that simply 222 have smaller conductance range. For example, if such a device has to be set to 0.9 mS, one 223 would set it to the highest achievable conductance ( $\sim 0.8 \,\mathrm{mS}$ ). In the following simulations 224 involving faulty devices and D2D variability, operating range between the first and the last 225 median points was used, the devices were chosen randomly from the  $128 \times 64$  crossbar and 226 set to the most desirable states, as described in this paragraph. 227

## 228 2. Line resistance

The effect of line resistance can be extremely detrimental in many crossbar-based implementations of ANNs. That is especially the case if the crossbars used <u>are\_large\_and</u> the resistance of the interconnects <u>are\_large\_is\_high</u> (compared to memristors' resistance). Because in a neural network many of the inputs are non-zero at any given time, a lot of current accumulates in the bit lines which results in significant voltage drops across the interconnects, and thus the current distribution across the crossbar is affected in a major way.

Although there are many possible options for how to map synaptic weights onto crossbar arrays, the choice can determine the role of line resistance. It is often the case that synaptic layers of ANNs are large in size. However, that does not mean that the weights in those layers have to be mapped onto crossbars of equivalent shape; not only is that sometimes impossible, but it can also amplify the effect of line resistance. For example, if a synaptic layer with 785 input neurons (as is the case with the first layer of our ANNs) was mapped onto a crossbar with 785 word lines, massive amounts of current would accumulate in the
bit lines.

The Ta/HfO<sub>2</sub> crossbar has shape  $128 \times 64$  and so this shape was chosen for all the simula-244 tions involving line resistance. Even relatively small ANNs of architecture 784(+1):25(+1):10245 would need  $2 \times (785 \times 25 + 26 \times 10) = 39,770$  memristors to be implemented. Even if not 246 all the inputs were used at any given time, it would not be possible to fit all the memristors 247 onto a single crossbar of shape  $128 \times 64$ . To overcome this, we decided to simulate multiple 248 crossbars, each of which would implement a subset of the synaptic weights, but, for a given 249 synaptic layer, would all compute in parallel. Because [785/128] = 7, seven crossbars were 250 used to implement the first synaptic layer; the first six crossbars utilised all 128 crossbar 251 utilized bottom 113 word lines, while the last one used only the bottom 17 other six crossbars 252 used bottom 112 word lines because  $\frac{785 - 6 \times 128 = 17113 + 6 \times 112 = 785}{113 + 6 \times 112 = 785}$ . The second 253 synaptic layer was implemented using eighth crossbar utilising utilizing its bottom 26 word 254 lines. 255

Figure 3A shows an example of how the first synaptic layer of 784(+1):25(+1):10 neural 256 network could be implemented. Specifically, it shows how the first subset of weights would 257 be implemented using one of the crossbars. Because we use proportional mapping scheme, 258 positive and negative weights would be implemented in different bit lines. In Figure 3A, 259 memristors designated to implement positive weights are coloured in blue, memristors des-260 ignated to implement negative weights are coloured in orange and unelectroformed memris-261 tors are coloured in black. Because simulations were constrained by experimental data, the 262 rightmost bit lines are some of the devices were left unused and assumed to contain only 263 unelectroformeddevices be unelectroformed. In practise, the crossbars could be manufactured 264 to fit the geometry of the ANNs. 265

In each synaptic layer, the corresponding output currents from each of the crossbars would be added together. Additionally, output currents at the bit lines implementing negative weights would be subtracted from the output currents at the corresponding bit lines <u>neighbouring bit lines (to their left)</u> implementing positive weights. For example, in the example configuration of Figure 3A, output current at the  $\frac{26^{\text{th}}}{2^{\text{cm}}}$  bit line would be subtracted from the output current at the  $\frac{1^{\text{st}}}{2^{\text{cm}}}$  bit line would be subtracted

Unfortunately, even when using multiple smaller crossbars, the interconnects can significantly disturb current distribution in the crossbar. Average output current decreases due

to line resistance in all seven crossbars of Ta/HfO<sub>2</sub> devices (whose resistance ranges from 274  $\sim 1 \,\mathrm{k\Omega}$  to  $\sim 11 \,\mathrm{k\Omega}$ , and their interconnect resistance is  $0.3 \,\Omega \,\mathrm{\Omega} = 0.32 \,\Omega$  in the word 275 and bit lines, respectively), are shown in the top heatmap of heatmap in Figure 3B. We can 276 see that the current decreases can range from  $\sim \frac{1512\%}{1512\%}$  at the outputs nearest to the applied 277 voltages to  $\sim \frac{1816\%}{1816\%}$  at the outputs in the rightmost bit lines that are used. Such large 278 current decreases often result from large input voltages that are applied at the top part of 279 the crossbar, far away from the outputs. Such inputs generate large amounts of current that 280 flow through large portions of the bit lines and, with voltage drops across interconnects, 281 disturb the overall current distribution in a major way. 282

In some applications, such as supervised learning, it might be possible to strategically 283 map certain inputs to certain word lines, so that the effect of line resistance is minimised. 284 We propose intensity-aware reordering of ANN's inputs in which we record the average 285 input intensities over training and verification sets, and then map inputs with highest 286 average intensities to the word lines closest to the outputs of a crossbar. This makes 287 it so that most of the current is generated near the outputs, while the currents in the 288 top parts of the bit lines are disturbed minimally. Bottom heatmap in Figure 3B shows 289 average current decreases when using such a scheme with an unseen test set we observe 290 significantly smaller decreases. Additionally, to make the influence of positive and negative 291 weights (which are affected very differently in the naive mapping of Figure 3A) more equal 292 and to increase the variability between different ANNs in a committee, we suggest random 293 reordering of inputs and outputs. Both intensity-aware and random reordering were used 294 in all the following simulations involving line resistance. The implementation of these 295 methods individually and in combination with each other is explained in more detail in the 296 supplementary information In the supplementary information, we provide a possible strategy 297 of mitigating line resistance effects in supervised learning. This scheme was not employed 298 in the simulations described in the main text because we wanted to find out how well the 299 CM method would deal with noticeable line resistance effects. 300

# 301 3. Inference accuracy

Figure 4 shows the accuracy of individual networks, as well as of their committees; memristive ANNs were simulated by taking into account three non-idealities of Ta/HfO<sub>2</sub> crossbar explored earlier—faulty devices, D2D variability and line resistance. As indicated by the yellow box plot in Figure 4, individual networks implemented digitally achieve ~95.9% median accuracy. Networks disturbed to reflect the effect of non-idealities achieve ~90.891.0% median accuracy, as indicated by the vermilion box plot. Although that is a substantial drop in accuracy, we see that as more networks are added to the committee, the more the accuracy increases. When 5 networks are used in a committee, median accuracy increases up to ~95.895.7%, as indicated by the rightmost green box plot.

### 311 C. Ta<sub>2</sub>O<sub>5</sub> RRAM

In order to explore the effectiveness of minimising adverse effects of RTN, we use another memristor technology based on  $Ta_2O_5$ . To investigate RTN, measurements from a single device were considered. To simulate line resistance effects, interconnect resistance from Ta/HfO<sub>2</sub> was used and the same crossbar shape was assumed.

### 316 1. Random telegraph noise

Memristors often suffer from RTN resulting in a different accuracy at any given instant in time.  $Ta_2O_5$  device was characterised by measuring the current of 8 resistance states multiple times. Figure 5 shows the cumulative probability plots for those resistance states, together with lognormal fits modelling the nature of RTN. One of the things that the figure reveals is that higher resistance states suffer from higher degree of RTN. Fits for every resistance state, together with occurrence rates (see Supplementary Table SII), were used to disturb the weights of ANNs in order to reproduce the effect of RTN.

#### 324 2. Inference accuracy

The results combining RTN and line resistance effects for  $Ta_2O_5$  device are shown in Figure 6. From the difference in median accuracy between yellow and blue box plots, we can notice that there is a significant drop in accuracy simply due to mapping of weights onto conductances. That is not surprising given that only 8 states were available for mapping. One can also observe that further drop in median accuracy due to non-idealities is not as

severe—it drops to  $\sim 94.294.1\%$ . The RTN disturbance magnitude is limited to <100% in 330 most cases, which possibly contributes to its smaller effect on accuracy. Additionally,  $Ta_2O_5$ 331 device has much higher resistance (ranging from  $25 \,\mathrm{k\Omega}$  to  $200 \,\mathrm{k\Omega}$ ), thus line resistance is also 332 less of a concern. When non-ideal networks are combined into committees of 5, the median 333 accuracy jumps to  $\sim 96.5\%$ —even higher than the software baseline of individual networks. 334 This reveals additional trend seen in all the simulations performed—the higher the accuracy 335 of the individual non-ideal memristive networks, the higher the accuracy of the committees 336 that they are part of. 337

### 338 D. aVMCO RRAM

Further, we consider a third memristor technology—one based on aVCMO materials. We test the effects of RTN by considering measurements from a single device. Line resistance effects were simulated by using interconnect resistance and shape of Ta/HfO<sub>2</sub> crossbar array.

### 342 1. Random telegraph noise

Figure 7 shows the cumulative probability plots for 8 resistance states of an aVMCO device suffering from RTN. Like in  $Ta_2O_5$ , we observe that higher resistance states experience RTN of higher magnitude. However, compared to  $Ta_2O_5$ , the RTN magnitude is much more predictable. Fits for each of the 8 resistance states, together with occurrence rates (see Supplementary Table SIII), were used to simulate the effect of RTN in aVMCO-based neural networks.

#### 349 2. Inference accuracy

The results combining RTN and line resistance are shown in Figure 8. As with  $Ta_2O_5$ , we see a large drop due to mapping onto conductances—consequence of very few states available for mapping. More interestingly, the accuracy of individual memristor-based networks with and without non-idealities is almost identical. That is because the occurrence rate of RTN in aVMCO device is small and there is a much smaller probability of RTN having large magnitude. Additionally, resistance of aVMCO device is even higher than that of  $Ta_2O_5$  device—it ranges from  $1 \text{ M}\Omega$  to  $7.5 \text{ M}\Omega$ . Therefore, line resistance has even a smaller effect in a hypothetical array of aVMCO devices. Due to median accuracy of individual non-ideal memristor-based networks being higher (~94.794.6%), the median accuracy of committees is higher too—in committees of size 5 it increases to ~96.696.7%.

# 360 III. DISCUSSION

The results from the previous section suggest that the method of using committee ma-361 chines to improve the accuracy of memristive neural networks is technology-agnostic technology-362 and non-ideality-agnostic. CMs can mitigate the effects of faulty devices, D2D variability, 363 RTN and line resistance in combination with each other. Although line resistance is more 364 difficult to deal with using committees due to the similar way in which all crossbars of 365 different networks get affected, using random reordering can increase the effectiveness of 366 ensembles of non-ideal memristive networks. In CM method is slightly less effective with 367 large line resistance (see discussion in the supplementary information), in all cases, we 368 observe that the accuracy of individual non-ideal networks largely determines the accuracy 369 of committees. That is consequential because it means that although committees always 370 increase the accuracy, there is still an incentive to optimise the devices and systems that 371 implement these networks—the higher the accuracy of individual networks, the higher the 372 accuracy of the committees. 373

It is also important to consider whether using larger networks, instead of committees 374 of smaller networks, would yield the same results if the same number of synapses (or 375 memristors) was used in the large network as in the committee of smaller networks. In 376 our previous work we found that the accuracy of networks before disturbance (which we call 377 "starting accuracy") has a huge effect on the robustness to non-idealities—the larger the 378 starting accuracy, the more robust the networks become [20]. One way to achieve higher 379 starting accuracy is to have larger networks, e.g. if we have a network with one hidden layer, 380 we might increase the number of neurons in that hidden layer, which would likely result in 381 higher accuracy after training and thus higher robustness. 382

Figure 9 shows a comparison of CMs of memristor-based networks disturbed using faulty devices and D2D variability data from  $Ta/HfO_2$  crossbar, when controlled for the total number of memristors that is required to implement them (line resistance was not taken

into account due to long time required to simulate it in large networks). We can observe 386 that committees of two networks, each with 25 hidden neurons, (leftmost data point of 387 the orange curve) achieve  $\sim 0.9\%$  higher median accuracy than individual networks with 388 50 hidden neurons (second data point from the left in the vermilion curve), despite both 389 requiring almost identical total number of memristors. Committees of two networks, each 390 with 100 hidden neurons, (third data point from the left in the orange curve) achieve  $\sim 1.1\%$ 391 higher median accuracy than individual networks with 200 hidden neurons (rightmost data 392 point in the vermilion curve), even though both require almost the same total number of 393 memristors. Even larger improvement is gained when committees of four networks, each with 394 50 hidden neurons, (second data point from the left in the blue curve) are used instead-395 then the accuracy is improved by  $\sim 1.5\%$ , with almost the exact total number of memristors 396 used. 397

For different non-idealities and even different training schemes of the ANNs, the equiv-398 alents of Figure 9 might be different, but there are a few common characteristics in all of 399 them. In all cases, for a given total number of memristors used, there is an optimal number 400 of networks that should be used in a committee. Additionally, we observe that the more 401 severe a non-ideality is, the more apparent the effectiveness of committees becomes. Finally, 402 sometimes the committees (for a fixed total number of memristors) might achieve lower 403 accuracy than individual networks but only if the networks that they replace are very small 404 and the non-ideality is not very detrimental. If the networks that are being replaced with 405 committees of smaller networks, are sufficiently large, the committees will achieve higher 406 accuracy. An example of that is shown in Supplementary Figure S5-S7 where aVMCO de-407 vice is minimally affected by the non-idealities and so the advantage of committees becomes 408 apparent only when replacing larger networks. 409

The reason why committees work in the context of non-ideal implementations and why 410 they work best when they are used to replace large networks might, to some extent, lie in 411 their training. When it comes to training fully connected networks, their accuracy tends to 412 saturate as more weights parameters are added. Supplementary Figure S2 S4 shows that 413 networks with 50 hidden neurons can be trained to achieve significantly higher accuracy 414 than networks with 25 hidden neurons. However, networks with 200 hidden neurons achieve 415 only slightly higher accuracy than networks with 100 hidden neurons. This also means that 416 networks with 200 hidden neurons will be only slightly more robust to non-idealities than 417

<sup>418</sup> networks with 100 hidden neurons. When such networks are affected by non-idealities, their <sup>419</sup> accuracy drops to similar values but the smaller network can work in a committee with <sup>420</sup> one more networkother networks, totalling almost the same number of memristors as the <sup>421</sup> large network, but achieving higher accuracy overall. This is the most likely reason why the <sup>422</sup> committees of smaller networks are effective at dealing with non-idealities, especially when <sup>423</sup> replacing large networks.

In addition to the accuracy improvements, committees can provide flexibility in mem-424 ristive implementations of neural networks. Digital implementations of ANNs have very 425 predictable behaviour due to the precision of digital logic. Analogue implementations, on 426 the other hand, can vary greatly even if they use the same weights before the mapping 427 onto conductances—that is a result of the stochastic nature of memristors that implement 428 these ANNs. The parallel and modular nature of committee machines makes memristive 429 systems much more flexible. For example, if the verification accuracy of one of the ANNs in 430 a memristor-based CM deteriorates below acceptable levels, its outputs could be disabled 431 to ensure higher accuracy of the rest of the committee. 432

Importantly, this introduced parallelism comes at almost no extra cost. For a fixed total number of memristors, a committee of smaller networks, compared to a large individual network, would only require a few additional output and bias neurons, and an averaging functionality, which could potentially be implemented in hardware. For example, an ANN with 50 hidden neurons would require 846 neurons in total, while a committee of two ANNs, each with 25 hidden neurons (and thus requiring almost the same total number of memristors), would require 857 neurons in total.

In summary, our simulations employing experimental data from three different types of 440 memristive devices show that committee machines employing ensemble averaging can be used 441 to mitigate the effects of device- and system-level non-idealities in memristor-based neural 442 networks. EA allows to achieve higher inference accuracy in physically implemented neural 443 networks that suffer from faulty devices, device-to-device variability, random telegraph noise, 444 and even line resistance. This method is a universal way to deal with the most common 445 non-idealities and is straightforward to implement during the fabrication stage. Increased 446 modularity of these memristive neural network systems will increase not only their inference 447 accuracy, but also their robustness and flexibility, even without the need to sacrifice area. 448 Although some level of non-idealities in memristors is unavoidable, CM method allows us 449

to deal with these on the system level and is agnostic to a particular technology or, to some degree, type of the non-ideality.

### 452 METHODS

### 453 Experiments

Ta/HfO<sub>2</sub> RRAM 1T1R array consists of NMOS transistors fabricated in a commercial 454 fab (feature size of  $2 \,\mu m$ ) and Pt/HfO<sub>2</sub>/Ta devices. The bottom electrode was deposited by 455 evaporation of 20 nm Pt layer on top of a 2 nm tantalum (Ta) adhesive layer; the electrode 456 was patterned by photolitography and a lift-off process. A  $5 \text{ nm HfO}_2$  switching layer was 457 deposited by atomic layer deposition using water and tetrakis(dimethylamido)hafnium as 458 precursors at 250 °C. Sputter-deposited Ta of 50 nm thickness followed by 10 nm Pd was 459 used in a lift-off process to serve as the top electrode. The filamentary based  $Ta_2O_5$  device 460 consists of a TiN/4nm stoichiometric  $Ta_2O_5/20$  nm nonstoichiometric  $TaO_x/10$  nm TaN/TiN 461 stack with a cross-sectional area of  $75 \,\mathrm{nm} \times 75 \,\mathrm{nm}$ , while the non-filamentary-based aVMCO 462 has a cross-sectional area of  $135 \,\mathrm{nm} \times 135 \,\mathrm{nm}$  and is composed of a TiN/8 nm amorphous-463 Si/8 nm anatase  $TiO_2/TiN$  stack.  $Ta_2O_5$  and aVMCO devices were fabricated by imec. The 464 detailed fabrication process parameters can be found in references [11, 28, 29] for Ta/HfO<sub>2</sub>, 465 Ta<sub>2</sub>O<sub>5</sub> and aVMCO RRAMs respectively. 466

The conductance of Ta/HfO<sub>2</sub> devices was modulated by applying SET pulses (500  $\mu$ s @ 467 2.5 V and gate voltage increasing from 0.6 V to 1.6 V). After each of the 11 cycles, RESET 468 pulses were applied (5 µs @ 0.9 V increasing to 2.2 V and gate voltage of 5 V). The voltage 469 was being increased linearly throughout the 100 pulses. All electrical tests for  $Ta_2O_5$  and 470 aVMCO devices were done with a Keysight B1500A. The RTN data is extracted by switching 471 the device into 8 uniformly distributed resistance levels between  $25 \,\mathrm{k\Omega}$  and  $200 \,\mathrm{k\Omega}$ , and 8 472 nearly uniformly distributed resistance levels between  $1 M\Omega$  and  $7.5 M\Omega$  with incremental 473 RESET DC sweeps [30] for  $Ta_2O_5$  and aVMCO respectively. RTN measurement is then 474 carried out at each resistance level at a 0.1 V and 3 V read-out for  $\text{Ta}_2\text{O}_5$  and aVMCO 475 respectively, with a sampling time of 2 ms/point and 10,000 sampling point per resistance 476 level for an RTN measurement period of 20 s. 477

### 478 Simulations

In this work, feed-forward ANNs with fully connected layers and continuous weights were 479 trained to recognise handwritten digits using the MNIST data base. All 60,000 MNIST 480 training images were used during the training stage; training set consisted of 50,000 images 481 and verification set consisted of 10,000 images. All 10,000 test images were used to evaluate 482 the inference accuracy of ANNs. Networks used 784 input neurons representing pixel inten-483 sities of MNIST images of  $28 \times 28$  pixel size, as well as one bias neuron. 10 output neurons 484 were used; they represented the ANNs' predictions of 10 handwritten digits. Hidden layer 485 layers used sigmoid activation function, while the output layer used softmax activation func-486 tion. Weights were optimised by minimising cross-entropy error function using stochastic 487 gradient descent. Learning rate of 0.01 and patience of 25 epochs were used. 25 networks 488 were trained for each architecture explored by initialising them differently. When numer-489 ically optimising ANNs' weightings, optimisation was performed by employing verification 490 set, while the performance was evaluated using the test set. The code was implemented in 491 Python. 492

Weights were mapped onto pairs of memristors' conductances using proportional map-493 ping scheme—synaptic weights were made proportional to one of the conductances in the 494 pair, while the other was left unelectroformed. The zero weight was interpreted as given— 495 in practise, it would be implemented by not electroforming the device, thus resulting in its 496 negligible conductance. Although aVMCO devices do not have electroforming stage, for con-497 sistency we assumed that additional insulating circuit elements could be used to implement 498 the zero weight. Negative weights would be implemented by placing certain memristors in 499 dedicated bit lines of the crossbars whose outputs would be subtracted from the outputs at 500 the corresponding bit lines implementing positive weights. Maximum weights after mapping 501 were optimised separately for each set of network architecture and conductance levels; in 502 each case this was done by excluding a certain proportion,  $p_{\rm L}$ , of weights with largest abso-503 lute values. What  $p_{\rm L}$  values were used for each simulation is summarised in Supplementary 504 Table SI. More details on the mapping procedure can be found in our past work [20]. 505

All non-idealities, except for line resistance, were simulated by disturbing the individual conductances of memristor-based ANNs. To investigate line resistance, <u>loop-nodal</u> analysis was employed. By setting up simultaneous linear equations using <u>Ohm's law and Kirch-</u> <sup>509</sup> hoff's current and voltage lawslaw, those were solved in sparse matrix representation using
<sup>510</sup> Python's library scipy.

After simulating memristor non-idealities, committees of different ANNs were composed. 511 Committees used EA, i.e. the outputs of individual networks in a committee were averaged 512 to produce a single output vector. In EA, the output vectors of individual networks can 513 simply be added together (if the weightings of different networks are the same, as we assume 514 in the main text); the label corresponding to the entry with the highest value would be 515 the prediction of the committee. This addition can be performed either in software, or, if 516 the activation function of the last neuronal layer can be implemented physically, it can be 517 performed by adding corresponding currents produced by the circuitry of this activation 518 function. 519

In the simulations, neural networks that go into a committee were chosen randomly. 520 This was done to reflect the most convenient strategy when manufacturing such systems-521 because one does not need to selectively choose the networks, manufactured crossbars can be 522 easily programmed without the need to replace them if they perform poorly when working 523 individually (unless their effect is so detrimental that they have to be ignored which can 524 be made possible with this technique). Besides, devices might change over time, thus these 525 simulations, which show what happens when one does not selectively choose the networks, 526 are valuable to investigate conditions where it is not possible to replace the networks. 527

In the simulations, 25 base networks were used (each having different set of weights) for 528 each of the architectures. Then all of their weights were mapped onto pairs of conductances 529 using HRS/LRS values extracted from experiments. Finally, to reflect the effect of each of 530 the non-idealities, all networks were disturbed multiple times. In each disturbance iteration, 531 multiple combinations of networks were chosen and their performance as a committee of 532 certain size was evaluated. In total, for each simulation (except numerically optimised 533 committees which used 1,000 points) most simulations, 10,000 data points were recorded 534 for a committee of every size—these data captured the variations of base networks, their 535 combinations and different disturbance iterations. Only simulations involving line resistance 536 or numerical optimisation of weights had fewer data points for some committee sizes (due 537 to long simulation times). 538

### 539 DATA AVAILABILITY

All data generated or analysed during The data that support the findings of this study are included in this published article (and its supplementary information file)available from the corresponding author upon reasonable request.

## 543 AUTHOR CONTRIBUTIONS

A.M. and D.J. conceived the idea and designed the study. A.M., P.F. and Z.C. performed the experimental measurements. D.J. performed the simulations and analysed the experimental and simulation results. C.L. and Q.X. provided the experimental data of the programming of a Ta/HfO<sub>2</sub> 1T1R RRAM array. A.M., W.D.Z. and A.J.K. supervised the research. D.J. wrote the initial manuscript. All authors contributed to the discussions of the results and improved the text.

## 550 COMPETING INTERESTS STATEMENT

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

### 553 FUNDING

A.M. acknowledges funding from the Royal Academy of Engineering under the Research Fellowship scheme, A.J.K. acknowledges funding from the Engineering and Physical Sciences Research Council (EP/P013503/1) and the Leverhulme Trust (RPG-2016-135), W.D.Z. acknowledges funding from the Engineering and Physical Sciences Research Council (EP/S000259/1).

- [1] E. Strubell, A. Ganesh, and A. McCallum, "Energy and policy considerations for deep learning
   in NLP," arXiv preprint arXiv:1906.02243, 2019.
- [2] S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with
- pruning, trained quantization and huffman coding," in *International Conference on Learning Representations*, 2016, San Juan (Puerto Rico), arXiv preprint arXiv:1510.00149.
- [3] C. Li, Z. Wang, M. Rao, D. Belkin, W. Song, H. Jiang, P. Yan, Y. Li, P. Lin, M. Hu, N. Ge,
  J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, "Long short-term
  memory networks in memristor crossbar arrays," *Nature Machine Intelligence*, vol. 1, no. 1,
  pp. 49–57, 2019, doi: 10.1038/s42256-018-0001-4.
- [4] Z. Wang, C. Li, W. Song, M. Rao, D. Belkin, Y. Li, P. Yan, H. Jiang, P. Lin, M. Hu, J. P.
  Strachan, N. Ge, M. Barnell, Q. Wu, A. G. Barto, Q. Qiu, R. S. Williams, Q. Xia, and J. J.
  Yang, "Reinforcement learning with analogue memristor arrays," *Nature Electronics*, vol. 2, no. 3, p. 115, 2019, doi: 10.1038/s41928-019-0221-6.
- [5] Z. Sun, G. Pedretti, E. Ambrosi, A. Bricalli, W. Wang, and D. Ielmini, "Solving matrix
  equations in one step with cross-point resistive arrays," *Proceedings of the National Academy*of Sciences, vol. 116, no. 10, pp. 4123–4128, 2019, doi: 10.1073/pnas.1815682116.
- [6] S. R. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou,
- "A phase-change memory model for neuromorphic computing," Journal of Applied Physics,
  vol. 124, no. 15, p. 152135, 2018, doi: 10.1063/1.5042408.
- [7] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. D. Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. P. Farinha, B. Killeen, C. Cheng, Y. Jaoudi, and G. W. Burr,
  "Equivalent-accuracy accelerated neural-network training using analogue memory," *Nature*,
  vol. 558, no. 7708, pp. 60–67, 2018, doi: 10.1038/s41586-018-0180-5.
- 582 [8] S. Yu, Z. Li, P. Y. Chen, H. Wu, B. Gao, D. Wang, W. Wu, and H. Qian, "Binary neu-
- ral network with 16 Mb RRAM macro chip for classification and online training," in *International Electron Devices Meeting*. IEEE, 2016, San Francisco (United States), doi: 10.1109/IEDM.2016.7838429.
- [9] J. Woo, K. Moon, J. Song, S. Lee, M. Kwak, J. Park, and H. Hwang, "Improved synaptic behavior under identical pulses using  $AlO_x/HfO_2$  bilayer RRAM array for neuromor-

- phic systems," *IEEE Electron Device Letters*, vol. 37, no. 8, pp. 994–997, 2016, doi:
  10.1109/LED.2016.2582859.
- [10] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B.
  Strukov, "Training and operation of an integrated neuromorphic network based on metaloxide memristors," *Nature*, vol. 521, no. 7550, pp. 61–64, 2015, doi: 10.1038/nature14441.
- [11] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery, P. Lin, Z. Wang,
  W. Song, J. P. Strachan, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, "Efficient and self-adaptive in-situ learning in multilayer memristor neural networks," *Nature communications*, vol. 9, no. 1, p. 2385, 2018, doi: 10.1038/s41467-018-04484-2.
- <sup>597</sup> [12] A. Chen and M. R. Lin, "Variability of resistive switching memories and its impact on cross <sup>598</sup> bar array performance," in 2011 International Reliability Physics Symposium. IEEE, 2011,

<sup>599</sup> Monterey (United States), doi: 10.1109/IRPS.2011.5784590.

- [13] J. Kang, Z. Yu, L. Wu, Y. Fang, Z. Wang, Y. Cai, Z. Ji, J. Zhang, R. Wang, and Y. Yang,
  "Time-dependent variability in RRAM-based analog neuromorphic system for pattern recognition," in *International Electron Devices Meeting*. IEEE, 2017, San Francisco (United States),
  doi: 10.1109/IEDM.2017.8268340.
- [14] L. Xia, W. Huangfu, T. Tang, X. Yin, K. Chakrabarty, Y. Xie, Y. Wang, and H. Yang,
  "Stuck-at fault tolerance in RRAM computing systems," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 8, no. 1, pp. 102–115, 2017, doi: 10.1109/JETCAS.2017.2776980.
- [15] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Dávila, C. E.
  Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell, Q. Wu, S. Williams, J. Yang,
  and Q. Xia, "Analogue signal and image processing with large memristor crossbars," *Nature Electronics*, vol. 1, no. 1, pp. 52–59, 2018, doi: 10.1038/s41928-017-0002-z.
- <sup>612</sup> [16] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni,
- and E. Eleftheriou, "Mixed-precision in-memory computing," *Nature Electronics*, vol. 1, no. 4,
   p. 246, 2018, doi: 10.1038/s41928-018-0054-8.
- 615 [17] M. Hu, J. P. Strachan, Z. Li, and S. R. William, "Dot-product engine as computing mem-
- ory to accelerate machine learning algorithms," in 17th International Symposium on Quality
- *Electronic Design*, 2016, Santa Clara (United States), doi: 10.1109/ISQED.2016.7479230.

- <sup>618</sup> [18] Q. Xia and J. J. Yang, "Memristive crossbar arrays for brain-inspired computing," *Nature* <sup>619</sup> *materials*, vol. 18, no. 4, p. 309, 2019, doi: 10.1038/s41563-019-0291-x.
- [19] Y. LeCun, C. Cortes, and C. J. C. Burges, "The MNIST database of handwritten digits,"
  2010. [Online]. Available: http://yann.lecun.com/exdb/mnist
- [20] A. Mehonic, D. Joksas, W. H. Ng, M. Buckwell, and A. J. Kenyon, "Simulation of inference accuracy using realistic RRAM devices," *Frontiers in Neuroscience*, vol. 13, p. 593, 2019, doi: 10.3389/fnins.2019.00593.
- [21] M. P. Perrone and L. N. Cooper, "When networks disagree: Ensemble methods for hybrid
  neural networks," in Artificial Neural Networks for Speech and Vision. Chapman and Hall,
  1993, pp. 126–142.
- [22] S. Hashem and B. Schmeiser, "Improving model accuracy using optimal linear combinations of
   trained neural networks," *IEEE Transactions on Neural Networks*, vol. 6, no. 3, pp. 792–794,
   1995, doi: 10.1109/72.377990.
- [23] B. Li, L. Xia, P. Gu, Y. Wang, and H. Yang, "Merging the interface: Power, area and accuracy
   co-optimization for RRAM crossbar-based mixed-signal computing system," in *Proceedings of* the 52nd Annual Design Automation Conference, 2015, San Francisco (United States), doi:
- 10.1145/2744769.2744870.
- [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional
   neural networks," in Advances in neural information processing systems, 2012, pp. 1097–1105,
   Lake Tahoe (United States), doi: 10.1145/3065386.
- [25] Z. Wang, C. Li, P. Lin, M. Rao, Y. Nie, W. Song, Q. Qiu, Y. Li, P. Yan, J. P. Strachan,
  N. Ge, N. McDonald, Q. Wu, M. Hu, H. Wu, R. S. Williams, Q. Xia, and J. J. Yang, "In situ
  training of feed-forward and recurrent convolutional memristor networks," *Nature Machine Intelligence*, vol. 1, no. 9, pp. 434–442, 2019, doi: 10.1038/s42256-019-0089-1.
- <sup>642</sup> [26] H. Jiang, L. Han, P. Lin, Z. Wang, M. H. Jang, Q. Wu, M. Barnell, J. J. Yang, H. L. Xin, and
- Q. Xia, "Sub-10 nm ta channel responsible for superior performance of a HfO<sub>2</sub> memristor,"
   *Scientific reports*, vol. 6, p. 28525, 2016, doi: 10.1038/srep28525.
- 645 [27] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S. Shenoy,
- P. Narayanan, K. Virwani, E. U. Giacometti, B. N. Kurdi, and H. Hwang, "Experimental demonstration and tolerancing of a large-scale neural network (165 000 synapses) using phase-change memory as the synaptic weight element," *IEEE Transactions on Electron De*-

- vices, vol. 62, no. 11, pp. 3498–3507, 2015, doi: 10.1109/TED.2015.2439635.
- 650 [28] Y. Fan, L. Zhang, D. Crotti, T. Witters, M. Jurczak, and B. Govoreanu, "Direct evidence
- of the overshoot suppression in  $Ta_2O_5$ -based resistive switching memory with an integrated access resistor," *IEEE Electron Device Letters*, vol. 36, no. 10, pp. 1027–1029, 2015, doi: 10.1109/LED.2015.2470081.
- [29] B. Govoreanu, D. Crotti, S. Subhechha, L. Zhang, Y. Chen, S. Clima, V. Paraschiv, H. Hody,
   C. Adelmann, M. Popovici, O. Richard, and M. Jurczak, "A-VMCO: A novel forming-free, self rectifying, analog memory cell with low-current operation, nonfilamentary switching and excel-
- lent variability," in Symposium on VLSI Technology, 2015, Kyoto (Japan), doi: 10.1109/VL SIT.2015.7223717.
- 659 [30] Z. Chai, W. Zhang, P. Freitas, F. Hatem, J. F. Zhang, J. Marsland, B. Govoreanu, L. Goux,
- G. S. Kar, S. Hall, P. Chalker, and J. Robertson, "The over-reset phenomenon in Ta<sub>2</sub>O<sub>5</sub>
  RRAM device investigated by the RTN-based defect probing technique," *IEEE Electron Device Letters*, vol. 39, no. 7, pp. 955–958, 2018, doi: 10.1109/LED.2018.2833149.
- [31] C. Sung, S. Lim, H. Kim, T. Kim, K. Moon, J. Song, J.-J. Kim, and H. Hwang, "Effect
  of conductance linearity and multi-level cell characteristics of TaO<sub>x</sub>-based synapse device on
  pattern recognition accuracy of neuromorphic system," *Nanotechnology*, vol. 29, no. 11, p.
  115203, 2018, doi: 10.1088/1361-6528/aaa733.
- [32] Y. Fang, Z. Yu, Z. Wang, T. Zhang, Y. Yang, Y. Cai, and R. Huang, "Improvement of HfO<sub>x</sub>based RRAM device variation by inserting ALD TiN buffer layer," *IEEE Electron Device Letters*, vol. 39, no. 6, pp. 819–822, 2018, doi: 10.1109/LED.2018.2831698.
- [33] B. Govoreanu, A. Redolfi, L. Zhang, C. Adelmann, M. Popovici, S. Clima, H. Hody,
  V. Paraschiv, I. Radu, A. Franquet, J. C. Liu, J. Swerts, O. Richard, H. Bender, L. Altimime,
  and M. Jurczak, "Vacancy-modulated conductive oxide resistive RAM (VMCO-RRAM): An
  area-scalable switching current, self-compliant, highly nonlinear and wide on/off-window resistive switching cell," in *International Electron Devices Meeting*. IEEE, 2013, Washington
  (United States), doi: 10.1109/IEDM.2013.6724599.
- <sup>676</sup> [34] A. J. Kenyon, M. S. Munde, W. H. Ng, M. Buckwell, D. Joksas, and A. Mehonic, "The
  <sup>677</sup> interplay between structure and function in redox-based resistance switching," *Faraday Dis-*<sup>678</sup> cussions, vol. 213, pp. 151–163, 2019, doi: 10.1039/C8FD00118A.

- [35] W. Wu, H. Wu, B. Gao, P. Yao, X. Zhang, X. Peng, S. Yu, and H. Qian, "A methodology to improve linearity of analog RRAM for neuromorphic computing," in *Symposium on VLSI Technology*. IEEE, 2018, Honolulu (United States), doi: 10.1109/VLSIT.2018.8510690.
- 682 [36] Z. Chai, P. Freitas, W. Zhang, F. Hatem, J. F. Zhang, J. Marsland, B. Govoreanu, L. Goux,
- and G. S. Kar, "Impact of RTN on pattern recognition accuracy of RRAM-based synaptic
- neural network," IEEE Electron Device Letters, vol. 39, no. 11, pp. 1652–1655, 2018, doi:
- 685 10.1109/LED.2018.2869072.



Figure 1. Using multiple neural networks to improve inference accuracy. A) The principle of EA. **B**) Using identical digital networks when implementing committees of memristive neural networks only helps to deal with the damage to the networks caused by the non-idealities. C) Using different digital networks when implementing committees of memristive neural networks both helps to deal with the damage to the networks caused by the non-idealities and allows to combine the knowledge of individual digital networks about the data set acquired by individual digital networks.



Figure 2. Experimental data of Ta/HfO<sub>2</sub> RRAM crossbar array of shape  $128 \times 64$ . A) Modulation of devices' conductance over 11 SET cycles, each consisting of a 100 potentiating pulses. Violin plots of gradual conductance changes are shown for all Ta/HfO<sub>2</sub> devices, with dots representing median conductance after a certain number of pulses. 100 points were used for Gaussian kernel density estimation. All violin plots have their maximum widths normalised. **B-F**) Examples of devices with their conductance (in mS) **B**) spanning the full range, **C**) spanning part of the full range, **D**) exhibiting cycle-to-cycle variability, **E**) stuck at high values, **F**) stuck at low values. These diagrams show conductance of five devices from Ta/HfO<sub>2</sub> crossbar array over 11 SET and RESET cycles. The radial component represents the conductance, while the angular component represents the number of applied pulses. The first SET cycle starts at the top of each of the diagrams. The conductance (in blue) over 100 SET pulses is displayed in a clockwise fashion across the right half of each of the diagrams. Following that, conductance (in orange) over 100 RESET pulses (starting at the bottom) is displayed across the left half of each of the diagrams, after which the next cycle is displayed. Cartesian version of these plots is shown in Supplementary Figure S9.



Figure 3. Theoretical implementation of a synaptic layer of shape  $785 \times 25$  using crossbars of shape  $128 \times 64$ . A) Mapping the first subset of weights onto one of the seven crossbars used to implement the whole synaptic layer. Positive weights and negative weights are mapped onto memristors in different bit lines. B) Heatmap of average changes in output currents due to line resistance (in all seven Ta/HfO<sub>2</sub> crossbars)without and with a scheme that maps certain inputs onto certain word lines depending on expected average intensities of those inputs. For this particular simulation, it was assumed that Ta/HfO<sub>2</sub> devices can be programmed perfectly.



Figure 4. Accuracy achieved by individual networks and their committees when faulty devices, D2D variability data and line resistance of Ta/HfO<sub>2</sub> crossbar are taken into account. The maximum whisker length is set to  $1.5 \times IQR$ .



Figure 5. Cumulative probability plots of RTN-induced relative current deviations for all 8 resistance states of a  $Ta_2O_5$  RRAM device. Lognormal fits are shown for each resistance state.



Figure 6. Accuracy achieved by individual networks and their committees when RTN data of a Ta<sub>2</sub>O<sub>5</sub> device are taken into account. Additionally, interconnect resistance of  $\frac{0.3 \Omega}{0.35 \Omega}$  and  $\frac{0.32 \Omega}{0.32 \Omega}$  in the word and bit lines, respectively, (from Ta/HfO<sub>2</sub> array) was used to include line resistance effects. The maximum whisker length is set to  $1.5 \times IQR$ .



**Figure 7**. Cumulative probability plots of RTN-induced relative current deviations for all 8 resistance states of aVMCO RRAM device. Lognormal fits are shown for each resistance state.



Figure 8. Accuracy achieved by individual networks and their committees when RTN data of an aVMCO device are taken into account. Additionally, interconnect resistance of  $\frac{0.3 \Omega}{0.35 \Omega}$ and  $0.32 \Omega$  in the word and bit lines, respectively, (from Ta/HfO<sub>2</sub> array) was used to include line resistance effects. The maximum whisker length is set to  $1.5 \times IQR$ .



Figure 9. Median accuracy achieved by individual one-hidden-layer memristor-based networks and their committees, when controlled for total number of memristors required. The networks contained 25, 50, 100 or 200 hidden neurons and were disturbed using faulty devices and D2D variability data from  $Ta/HfO_2$  crossbar.

| First author<br>(year)      | Non-ideality                  | Device type                         | Proposed solution                                                                      |
|-----------------------------|-------------------------------|-------------------------------------|----------------------------------------------------------------------------------------|
| C. Sung<br>(2018) [31]      | Current/voltage non-linearity | $TaO_x RRAM$                        | Hot-forming step is adopted                                                            |
| C. Li<br>(2018) [15]        | Current/voltage non-linearity | $Ta/HfO_2 RRAM$                     | 1T1R architecture is adopted                                                           |
| Y. Fang<br>(2018) [32]      | Device-to-device variability  | $\mathrm{HfO}_{x} \mathrm{RRAM}$    | Ultra-thin ALD-TiN<br>buffer layer is introduced                                       |
| B. Govoreanu<br>(2013) [33] | Device-to-device variability  | $Al_2O_3/TiO_2$ (VMCO) RRAM         | Non-filamentary RRAM is adopted                                                        |
| A. J. Kenyon<br>(2019) [34] | Device-to-device variability  | $\mathrm{SiO}_x$ RRAM               | The roughness of bottom<br>electrodes is increased                                     |
| L. Xia<br>(2017) [14]       | Faulty devices                | -                                   | A modified mapping algorithm<br>and redundancy schemes are used                        |
| S. Ambrogio<br>(2018) [7]   | Limited dynamic range         | PCM                                 | Two pairs of conductance of varying significance<br>for every synaptic weight are used |
| M. Hu<br>(2016) [17]        | Line resistance               | -                                   | Advanced mapping algorithms are used to<br>compensate for line resistance effects      |
| W. Wu<br>(2018) [35]        | Programming non-linearity     | $\mathrm{HfO}_{x} \mathrm{RRAM}$    | Electro-thermal modulation layer is<br>deposited on the switching layer                |
| J. Woo<br>(2016) [9]        | Programming non-linearity     | HfO <sub>2</sub> RRAM               | Bilayer structure is adopted                                                           |
| S. Ambrogio<br>(2018) [7]   | Programming non-linearity     | PCM                                 | PCM devices are used together<br>with CMOS transistors                                 |
| Z. Chai<br>(2018) [36]      | Random telegraph noise        | TiO <sub>2</sub> /a-Si (aVMCO) RRAM | Non-filamentary RRAM is adopted                                                        |

**Table I**. Examples of past efforts at dealing with non-idealities of memristive devices and their systems.