Abstract-A method is presented for automated improvement of embedded application reliability. The compilation process is guided using genetic algorithms and a multiobjective optimization approach (MOOGAs). Even though modern compilers are not designed to generate reliable builds, they can be tuned to obtain compilations that improve their reliability, through simultaneous optimization of their fault coverage, execution time, and memory size. Experiments show that relevant reliability improvements can be obtained from an efficient exploration of the compilation solutions space. Fault-injection simulation campaigns are performed to assess our proposal against different benchmarks, and the results are assessed against a real Advanced RISC Machinesbased system-on-chip under proton irradiation.
of their nature, traditional hardware redundancy techniques cannot be applied to their structural components.
In this context, software-implemented hardware fault tolerance (SIHFT) techniques are intended to run reliable software over unreliable hardware [1] . Although these techniques increase reliability, the necessary instrumentation of their code causes important overheads in both memory footprints and execution times that deserve serious consideration [2] , [3] .
A potential method for modifying reliability without code instrumentation is by reproducing the way that modern compilers build the programs. In fact, if compiler parameters and flags are properly used, code can be reordered, useless instructions can be removed, unnecessary loops can be reduced, and constant operations can be precalculated, among many other optimizations. These changes produce different executables with the same functionality and may affect the observed reliability of the application. As a result, the same source code can be used to invoke many different executables with particular features, such as an improved execution time, a reduced memory footprint, and even increased fault coverage. In summary, SIHFT techniques gain reliability by instrumenting the code under protection, while compilers reorganize and optimize the code and, as a side effect, may improve its reliability.
In this context, modern compilers, such as GNU-Compiler Collection (GNU-GCC) and Clang/LLVM, are known to offer a wide range of optimization parameters that are intended to reduce the code size or the execution time needed to complete the whole program. For instance, Clang/LLVM supports more than 250 optimizations and GNU-GCC offers 230 optimizations and 195 parameters for modifying those optimizations [4] . However, those compilers offer no predefined optimization associated with reliability improvements. Several studies in this area have approached the question of what influence the standard optimization levels have on application reliability. Demertzi et al. [5] analyzed how the first three optimization levels of GCC (named O1, O2, and O3) impacted on the expected number of failures in some specific processor structures. Medeiros et al. [6] added a further predefined optimization (Os) to their study and estimated the soft error resilience of 24 applications running on a SystemC model of a Microprocessor without Interlocked Pipelined Stages processor. Even though the results suggested that this flag provided better overall system behavior, in general, no clear relation was established between the standard optimizations, applications, and reliability enhancements. A similar conclusion can be found in [7] , a study that concerned an Advanced RISC Machines (ARM) processor and emulated fault injection on the real hardware. When compared with the cross sections obtained in heavy-ion experiments, the results showed different trends. According to the authors, that divergence could be explained by the partial injection campaigns, which targeted only the register file of the processor.
The overall picture becomes more complicated when considering all the available parameters and options. Iterative compilations [8] have shown important performance improvements that could be applied to reliability optimization. However, the computational effort that is required, similar to a brute force approach, makes any comprehensive exploration of the solution space or even of a reduced subset of that space unfeasible. Narayanamurthy et al. [9] proposed the use of a genetic algorithm (GA) to alleviate that problem, which could identify compiler optimization sequences capable of improving application performance levels without degrading error resilience. The proposal was implemented without considering any specific processor (faults were injected on intermediate code) and the study was limited to a reduced subset of ten optimizations provided by the Clang/LLVM compiler. A preliminary work of our own is presented in [10] . It combined GA with a multiobjective optimization (MOO) algorithm to explore the complete GCC solution space. The study presented a strategy based on register file vulnerabilities for improving the overall fault coverage of a particular low-end 16-bit processor.
Our above-mentioned work is continued and extended in this paper by studying a complex 32-bit ARM-based architecture and GCC, one of the most widely used compilers. It presents the following contributions with respect to previous works. In the first place, and contrary to other approaches, our method takes into account all the GCC optimizations and parameters. As a result, it takes advantage of all the sophisticated processor features (out-of-order execution, branch predictors, pipeline, etc.) and conducts an in-depth exploration of compiler opportunities for the improvement of system reliability. In the second place, in addition to the standard reliability factors-fault coverage and execution time-a new one is considered: memory size. Consequently, our method offers new tradeoffs to fit the system constraints. Finally, memory section vulnerability concurrently with the vulnerability of the register file is taken into account for estimating the fault coverage of the application, which represents a remarkable difference with respect to other approaches because it increases the accuracy of the estimations. Furthermore, the solutions offered by our method, in a majority of cases, showed similar trends in proton-irradiation validation tests.
The search for the best compiler string that will, in general, serve to improve application reliability is a very complex task that lies beyond the scope of this paper, due to a large number of compiler optimizations under consideration. The main goal of our work is, therefore, to provide a method that, with any given application, will produce the executables with the best tradeoffs among the three objectives that define its overall reliability.
Compared to traditional SIHFT techniques, which improve fault tolerance at the expense of important time and memory overheads, our method is simultaneously capable of increasing fault coverage and improving both performance and the memory footprint. The results show that, even when applying aggressive optimizations, the new approach can maintain and even increase fault coverage and will, consequently, produce reliability increments of up to 4.2× in terms of the mean work to failure (MWTF) metric. Our method is not designed to replace traditional redundancy techniques but to complement and to reinforce them. In this way, compilation can be tuned before and after applying any specific SIHFT technique.
The rest of this paper will be organized as follows. Section II will start with a review of the compilation process and the role of the optimizations. It will then present our approach for tuning compilations. In Section III, the case study will be described together with the framework that is implemented to perform the space exploration. Similarly, in Section IV, the details of the experimental setup used in radiation tests will be presented. Section V will show the solutions obtained by the genetic algorithms and a multiobjective optimization (MOOGA) approach and, in Section V-B, those solutions will be compared with the radiation results. Finally, the conclusions of the work will be outlined in Section VI.
II. COMPILER-GUIDED RELIABILITY IMPROVEMENT USING MOOGA

A. Background on Compiler Optimizations
Compilers evolved from simple source code translators of high-level code to machine code some time ago. The complexity of the first stages, known as front end, has developed to the point where optimizations may even pass by undetected by programmers. However, the last stages, known as back end, have planning and resource utilization capabilities that can follow different strategies. These behaviors are controlled by users with a list of optimizations and parameters that are built into each compiler. It is a practical impossibility to establish the behavior of each one, due to their high number, for which reason compilers will usually offer a set of well-known optimizations. In the case of GCC, these optimizations, known as -O flags, will produce different levels of optimization. For instance, when -O1 is enabled, the compiler attempts to reduce execution time and code size, without performing any optimizations that will consume compilation time. -O2 compared to -O1, increases both, compilation time and performance, and -O3 introduces further optimizations. In turn, -O2 will enable code reorganization and will analyze the program to identify constant function parameters. While -O3 introduces function inlining and removes loops with a relatively low number of iterations, -O0 applies optimizations with a relatively low impact on the final executable, and -Os performs optimizations designed to reduce code size. Finally, -Ofast enables aggressive optimizations that could in some cases imply loss of accuracy. All these predefined optimization steps are centered on performance and memory footprint. They take no account of the reliability of the final application. Apart from the -O flags, GCC offers a lot of optimization steps, which usually have associated parameters for functional controls and can produce different effects. Some of them are intended to produce function cloning, to make interprocedural constant propagation stronger (fipa-cp-clone), or they are designed to minimize stack usage (fconserve-stack), and others may have structural effects, such as parallelization and inline functions (finline-functions-called-once), or they may affect the instructions scheduling such as l2-cache-size and conserve-stack parameters. Some of them are bivaluated, while others accept integer values.
In general, the effect of a combination of optimizations/parameters is difficult to predict. Even some flags and ranged parameters could have the same behavior depending on the problem. For instance, loop-unroll optimizations are used to speed up the calculation, reducing the number of jumps and variable checks. Depending on the unroll factor that is chosen, many different constructions can be generated with different size-performance-reliability tradeoffs. However, if the unrolling factor is increased beyond the total number of iterations, no further effect will be produced, and no different build will be generated. Furthermore, if the unrolling factor applied is not a divisor of the number of iterations, the remaining iterations must be performed outside the main loop. Compared to the perfect unroll, the additional code reduces performance, increases the lifetime of some variables and, in some cases, may include new variable checks (e.g., when the number of iterations is unknown before compilation).
B. MOOGA Approach
The potentially unworkable number of optimizations mentioned in Section I requires a strategy to accelerate the search for the combinations with the best features. In this context, we propose the use of GAs [11] for efficient exploration of the solutions space, together with a MOO algorithm [12] to deal with different objectives that affect the reliability of applications. The so-called MOOGA [13] approach will produce those candidates (individuals) that offer better tradeoffs between each other.
GAs are probabilistic search algorithms used for high-dimensional stochastic problems with nonlinear solutions. These groups of techniques define a branch of evolutionary algorithms (EAs) [14] . GAs are algorithms inspired by the evolution of the species. Thus, the individuals with better qualities have better chances of passing their genes to the next generation, while the worse are less likely to do so. GAs make use of the concept of crossover, to combine two individuals in a new one that shares the genes of both parents. There is also the concept of mutation, which randomly changes one gene from the genome of the individual. Crossover and mutation give GA the ability to perform a gradient descent search, with no stacking at local minima.
In our case, an individual is defined by a certain combination of compiler optimizations and parameters that become its genes (see Fig. 1 ). In that way, individuals are coded using an array (chromosome) containing the state of each possible optimization parameter and flag. Each individual, therefore, describes a program compilation the behavior of which may differ from the other individuals.
In real-life problems, objectives that are under evaluation are not always independent of each other. The objectives are commonly related or in conflict with each other, which prevents simultaneous improvements. In such cases, MOO algorithms, which compute the weight of each objective function separately and then combine them in a single composite function, obtain the best compromise from among various objective functions.
In MOO sorting algorithms, the solutions will be ordered by the degree to which they meet the different objectives so that the solutions reported by MOO are based on the concept of nondominance of Pareto efficiency. The Pareto optimal front shows multiples solutions with different degrees of satisfaction of the objectives. In addition, those solutions are characterized by their inability to improve any objective without worsening the others. Our approach makes use of the well-known Non dominated Sorting Genetic Algorithm-II (NSGA-II) [15] MOO. NSGA-II is a MOO algorithm based on nondominant classifications, which constructs an initial arrangement based on nondominated individual fronts. After the fronts are built, NSGA-II generates the individuals that belong to the same front in another order. NSGA-II uses the crowding distance function to estimate the diversity value of a solution. In that way, individuals are evaluated on the basis of their diversity within the dimension of each objective. The goal is to maintain a good spread of individuals and to increase the scope of the solution space that is explored.
In our case, the final executable needs to achieve improvements in fault coverage, performance, and the memory footprint, which directly influences the overall reliability of an application. Those objectives and their interdependencies in embedded processors imply that the problem to be undertaken is a complex one.
MOOGA combines a GA with MOO. Fig. 2 shows how the combination of these two algorithms works to improve the overall reliability of an application. The first step is initialization, which oversees the gene encoding and produces a population of randomly generated individuals. The evolutionary loop is the second step, where our MOOGA approach iterates over several generations. Each generation is produced from the previous one crossing and mutating the best fitted individuals. Those individuals are evaluated by means of fault injection campaigns and ranked by MOO in terms of Pareto efficiency. The process ends when a reliability goal is fulfilled or when a predefined number of iterations are reached. As a result, MOOGA processes all individuals on the Pareto front, which were collected across the successive generations. Engineers can take advantage of all this information to select the individual that best fits the system requirements. In this work, we selected some of them to be irradiated and to show the quality of the results that our proposal can offer.
III. SIMULATION SETUP
In an assessment of the strategy explained earlier, a set of benchmarks was selected from the Beebs (Open Benchmarks for Energy Measurements on Embedded Platforms) [16] project: QuickSort, NDES, Dijkstra, and BubbleSort. BubbleSort is a sorting algorithm that involves basic loop constructs, integer comparisons, and simple array handling. New Data Encryption Standard (NDES) is a block cipher based on a deterministic algorithm that operates on matrices stored in the memory known as keys, with between 65 and 640 elements. The algorithm takes a fixed-length block of 64 bits and transforms it through different operations (permutations, substitution, XOR, etc.) into another bitstring of the same length. The algorithm includes nested loops and deterministic memory access patterns. In addition, keys integrity is crucial, because any minimal change in them could lead to the destruction of the data sets that are ciphered. Dijkstra is an algorithm that establishes the shortest path between nodes in a graph. The algorithm analyzes an adjacency matrix, which stores the weight of each route, following a random access pattern. Finally, QuickSort is another sorting algorithm which operates in-place, requiring small additional amounts of memory to perform the task. The algorithms that were selected presented a variety of programming structures that are suitable for the application of different compiler optimizations. The compilations were performed by the GCC compiler from the Linaro project (version arm-eabi-gcc v7. 2-2017.11) .
A state-of-the-art ARM cortex A9 processor instruction-accurate model was the underlying architecture for each benchmarking test. It has a 32-bit CPU that includes a register file of 18 registers. The first 13 of them [R0-R12] are general purpose registers. The remainder, such as the stack pointer (SP), link register (LR), program counter (PC), floating-point status (FPS), and current program status register (CPSR) are control registers. The processor has a load/store architecture, which means that all the instructions operate with registers, except for load and store instructions. Cortex-A9 has a partial out-of-order eight-stage pipeline that includes a branch prediction block and support for two levels of cache. Modern compilers, such as GCC that is used in this work, take advantage of all these sophisticated features to improve the executable code.
A. Fault Injector Manager
Once an application from the benchmark is compiled with a defined set of optimizations and parameters, its size in KiB of the corresponding executable was used as one of the objective functions. The second objective, performance, was measured in terms of execution time using the Imperas OVPsim simulator [17] and was expressed in cycles. The evaluation of the fault coverage against soft errors was performed by means of fault injection campaigns, based on the bit-flip model with an injection of one fault per run. Each fault was emulated by means of a single bit-flip in a randomly selected bit from the resources (microprocessor register file and memory) and in a randomly selected clock cycle from the program duration. For this purpose, a custom plug-in was developed giving the simulator nonintrusive fault injection capabilities [18] . In doing so, no benchmarks were modified or instrumented with unnecessary code for injecting the faults. Moreover, this plug-in offers flexibility in the selection of the resources and the memory sections for fault injection. The boot code used to initialize the device was not considered in the injection, which yields fault coverage estimations of greater accuracy. An extension of fault injection manager (FIM) framework [19] was used for conducting the fault injection campaigns. FIM automatically gathers the ground truth parameters of an executable code, such as execution time and the memory map of the different sections. FIM controls the injection campaign by means of several user-defined parameters (e.g., number of faults, maximum allowable execution time, resources to be injected, etc.) and records the overall results. The fault effects are classified by FIM as ACE-unnecessary for architecturally correct execution, in case the system completes its execution and obtains the expected output after a fault is injected. Otherwise, they are classified as ACE-architecturally correct execution, which comprises any undesirable effect categories such as uncorrected faults [silent data corruption (SDC)], abnormal program termination or infinite execution loop (Hang) [20] . Each campaign was configured to inject 1000 faults per register in the register file and 18 000 faults in the memory segment allocated by the benchmark. This arrangement implies a total of 72 000 faults per individual (program version), achieving a statistical error of ±0.01 at a 99% confidence level, according to the statistical model proposed by Leveugle et al. [21] .
B. MOOGA Parameters
The MOOGA algorithm was configured to produce successive generations, each of 500 individuals. Our approach implemented the uniform mutation operator, and the probability of change was, therefore, the same for each gene. Similarly, the uniform crossover operator was implemented, which is defined as the probability of exchanging each gene of the chromosome with some of its two parents. Mutation and crossover GA operators were set with a probability of 5% for most of the process. During the first phase of MOOGA, a high rate of mutated individuals was used to improve the MOOGA dynamic, assuring a richer population and accelerating the convergence of the algorithm.
The individuals that represent the main compilation flags of the GCC compiler were added to the initial population. These flags, as previously mentioned, are referred to as -O0, -O1, -O2, -O3, -Ofast, and -Os. They describe sets of well-known and reliable optimization strategies for: increasing performance (options -O0 up to -Ofast) and program size shrinking (-Os). The same flags were also used as reference points to compare the best individuals generated by MOOGA. Three objectives were selected for simultaneous optimization because they are known to have a direct influence on program reliability: 1) memory footprint of the executable code, which defines the vulnerability area of the program; 2) execution time, which is proportional to the time that resources are exposed to faults; and 3) the intrinsic vulnerability factor of the code expressed as the percentage of ACE faults. The simultaneous minimization of each objective defines our search space.
The metric MWTF was also employed in this paper. It was first defined by Reis et al. [22] as the relationship between the amount of work completed and the number of errors encountered. MWTF was designed to compare the effectiveness of different hardware and software techniques, as it captures the inherent tradeoff between fault coverage improvements and the performance degradation that they produce. We likewise used this metric to compare the quality of different solutions obtained by our method. It is expressed as follows:
where the raw_error_rate is determined by the circuit technology. In our experiments, we used different executables running on the same device (technology), so this term of the equation can be considered as a constant and was not expressed in the results. The execution_time term is the time to execute a given unit of work. A unit of work is an abstract concept the specific definition of which depends on the application. In our case, work may be better defined as the execution of a program. AVF stands for architecture vulnerability factor and is estimated by statistical fault injection and expressed as the ACE percentage. We used the convergence of the MWTF among those individuals belonging to the Pareto front and a maximum computational effort of 250 generations as the stop criterion for the MOOGA loop during the experimental tests. This number of generations was observed to be sufficient for the convergence of all the benchmarks.
IV. RADIATION SETUP
The device under test (DUT) selected for the irradiation experiment was the Zynq Board, equipped with a 28-nm CMOS Xilinx ZYNQ XC7Z010 system-on-chip (SoC). This SoC is divided into two parts, a field-programmable gate array (FPGA) area [programmable logic (PL)] and a 32-bit ARM Cortex A9 microprocessor [processing system (PS)]. In addition, the microprocessor has a built-in memory called on-chip memory (OCM), onto which the bootloader or the test program can be loaded. The test application was compiled with the same compiler as in the simulation, adding the board support package (BSP) provided by Xilinx to initialize the DUT.
The DUT was controlled by an external computer, the RaspberryPi 3 Model B, the main task of which was to receive and log all the messages sent by the DUT. The DUT was configured to send a state message every 5 s in the absence of errors, otherwise the message would be instantly notified and the external computer would reset and reprogram the DUT.
The test campaign was performed at the National Centre for Accelerators, in Spain, at the start of 2018 [23] . The irradiation tests were performed using the external beam line, installed in the cyclotron laboratory.
Although the proton energy delivered by this cyclotron was set at 18 MeV, the beam extraction system was upward toward the air gap to irradiate the DUT. In this case, the DUT was placed at 53.5 cm from the exit nozzle with a Mylar foil window of 125 µm, so that the final energy at the surface was 15.2 MeV, with an estimated spread of ∼300 KeV. Previous tests at the Centro Nacional de Aceleradores have shown that the energy range of incident protons in the silicon active area, 10-8 MeV, is sufficient to produce events without thinning them [24] . The final energy of the incident beam at the surface and in the active area was obtained by using the energy loss data calculated with the SRIM2013 code [25] .
Proton flux monitoring was performed indirectly, as the direct current reading on the DUT was not available. During the tests, the beam current was measured in an electrically isolated graphite collimator situated behind the exit window. In this study, a Brookhaven 1000C current integrator was used at a frequency scale of 600 pA (10-pA sensitivity). With daily calibrations, a correlation factor was achieved by simultaneous measurements into the graphite collimator and another graphite plate at the DUT position. In addition, a grounded aluminum mask in front of the target was used to avoid induced currents effects between both items and to define a uniform area of irradiation.
The flux value was constant and fluctuated under 5% during each run. A medium flux value was calculated, based on the pulses registered by the counter. Finally, the fluence at the DUT was calculated as a function of the exposure time for each run with an accuracy of 10%. Under these experimental conditions, the beam uniformity was higher than 90% in the area of interest.
V. EXPERIMENTAL RESULTS AND DISCUSSION
Prior to the irradiation, a MOOGA optimization stage was performed on the entire benchmark suite considering all GCC options and parameters. The computational effort required for each application differed in accordance with its complexity and the number of generations needed for MOOGA convergence. The most computationally intensive application was NDES. In that case, a single fault injection campaign (72 000 faults) lasted between 3 and 7 min depending on the individual. The MOOGA implementation was improved to skip the evaluation of equivalent individuals (i.e., their chromosomes included different optimizations and parameters but they produced identical executable files). In that way, the whole tuning process of NDES lasted 5 days, running on a PC desktop with an x86 processor (Intel core i5).
A. MOOGA Simulation
For the sake of simplicity, only two applications are shown in Figs. 3-5 , where the algorithms BubbleSort and NDES are represented in (a) and (b), respectively. Fig. 3 shows the population obtained from the MOOGA optimization process and the distribution of the individuals (small size blue dots) around the solution space. The features of the algorithm, such as instruction-level parallelism, memory access patterns, data and control structures, among others, influence the capabilities of the compiler for the effective application of its entire arsenal of optimizations. BubbleSort is one of the simplest sorting algorithms, which presents low instruction parallelism and interacts poorly with modern processor hardware. It produces at least twice as many writes as other more sophisticated sorting algorithms (e.g., insertion sort), twice as many cache misses, and more branch mispredictions. It also translates into a very limited number of optimizations with an observable effect in the executable file. Fig. 3(a) shows the reduced solution space of BubbleSort, which includes several groups of nearly identical individuals. On the contrary, the NDES algorithm presented a higher level of instruction parallelism, deterministic loops, and repetitive access patterns. All these features permit a deeper exploration of the optimization space, as shown in Fig. 3(b) that show greater differentiation of individuals and a wider population spread. In both cases, BubbleSort and NDES, the individuals are grouped in clusters, which indicate the presence of optimization sets that have a strong incidence on the objectives under evaluation. Another important element is the Pareto surface (in gray), which represents the frontier of enhancements in which the best candidates are represented (medium size green dots). Similarly, singular individuals are shown with large size red dots.
For a detailed analysis of the population, the individuals are shown in Figs. 4 and 5 facing pairs of objectives. The ACE rate against the execution time is shown in Fig. 4 . In both applications, the fault coverage range of the population is significant. For instance, the fault coverage of the BubbleSort populations ranges between 27.7% ACE (MaxACE) and 18.6% ACE (MinACE), while the number of cycles needed to complete the programs varies from its minimum located at 43 Kcycles, up to its maximum located at 520 Kcycles. Looking at the third objective under evaluation (Fig. 5) , it can be appreciated that it ranges from 238 to 146 KiB. In summary, the fault coverage can be improved by nearly 9.1%, while its performance can be improved by about 12× and the memory overhead reduced by 1.6×, simply by tuning the compilation process. The NDES algorithm showed similar behavior to BubbleSort. In this case, the difference was nearly 12% in the ACE rate between MaxACE and MinACE individuals, the performance variation was close to 5×, and the memory overhead was about 2.5×.
Figs. 4 and 5 also reveal the complex relations between the objectives. Intuitively, larger memory usage will produce applications that are more prone to faults. However, this relation is unclear and it depends on other factors. For instance, if the memory footprint increases, but the lifetime of the stored variables decreases, it could lead to a fault coverage improvement. This effect can be seen in Fig. 5(a) , where individuals with a similar memory footprint (about 200 KiB) present a range of ACE percentages between 20% and 26%. Moreover, the individual with the minimum ACE (i.e., maximum fault coverage) presents a maximum memory footprint of around 230 KiB. Fig. 5(b) shows the higher differences in the same case of 200 KiB, where ACE varies between 14% and 25%. Therefore, the key is not the memory size but the way the memory is used. This fact corroborates the relevance of considering memory size as the third optimization objective in our approach. An analogous behavior can be observed in Fig. 4 , where the individuals that present very close execution times form vertical clusters with a significant variability in fault coverage. Indeed, the relationship between these three parameters is not clear, and it is not possible a priori to establish how they will evolve. These are the tradeoffs that our algorithm is seeking to exploit.
The behavior of the main optimization flags can be observed in Fig. 4(a) . The individual O0 has the best fault coverage (20% ACE), at the cost of having 2× more cycles than the others. Focusing now on the best individuals in terms of performance, degradation can be seen in their fault coverage. For instance, O1 shown in Fig. 4(a) has the lowest fault coverage lower than 5% O0. Likewise, the higher optimization levels-O2, O3, Ofast, and Os-showed similar behaviors to O1 in terms of performance, while in terms of fault coverage, this group lowered its reliability by around 2%. Regarding NDES [ Fig. 4(b) ], the set O1, O2, O3, Ofast, and Os showed an increase in the ACE rate of 8% compared to the worst option (O0). This significant worsening in fault coverage was at the expense of a performance increase of ∼4×.
Prior to the radiation experiment, some relevant individuals were characterized. Fig. 6 presents a summary of the improvements to the objectives for the main optimization flags and for the individuals with maximum and minimum ACE percentage and those with maximum and minimum MWTF metrics. The fault coverage variations between them are remarkable if we consider the individual O0 (baseline), which is the reference compilation with no optimization at all from among all the benchmarks under evaluation. It can be seen that this build had the best fault coverage, except for the minACE build, at the expense of more cycles than the other individuals.
As can be seen, MOOGA can obtain individuals with a similar fault coverage to O0, but much shorter execution times, resulting in important improvements of MWTF (e.g., BubbleSort individual MaxMWTF improved this metric by as much as four times). A significant individual in the BubbleSort experimental test was the one labeled P.Pareto, located in the corner of the Pareto frontier, which offered a good tradeoff between the objectives under evaluation [Figs. 4(a) and 5(a)].
Regarding the remaining optimizations, the cycle speedup led to a worsening of the ACE results. As the final objective was to reduce this ACE rate to a minimum, insofar as possible with no loss of previously acquired speedup, the only builds to achieve those objectives were the ones that applied the MWTF metric. This metric not only takes into account the ACE rate of the program but also the time needed to complete it. As can be seen from the results of all the programs, MinMWTF individuals were characterized by having the highest number of cycles and a relatively low ACE rate. On the contrary, MaxMWTF individuals on the Pareto frontier were the best in terms of performance.
Finally, it is worth comparing those numbers with the ones obtained in our previous study on a simpler 16-bit processor (TI-MSP430). In that case, considering the worst and the best individuals of the MOOGA approach, it achieved improvements of up to 6% in fault coverage and up to 45% for the MWTF metric. Meanwhile, this new study on ARM, considering the O0 (not the worst individuals) as the baseline, showed improvements of up to 13% in fault coverage and of up to 420% for the MWTF-metric. Even if the inaccuracies in the Mixed Signals Processor (MSP) executables evaluation (the memory was excluded from the injection campaigns) are considered, those remarkable differences reveal that sophisticated ARM architectural features, such as out-of-order execution, speculative execution, instruction pipelines, and branch predictions, play an important role in the reliability improvements, permitting the compiler to squeeze the high-level optimizations. This hypothesis is corroborated when comparing the number of different individuals (unique individuals) produced by our method for MSP and ARM. A maximum of 51 unique individuals were obtained (synthetic program) for MSP, while MOOGA explored more than 722 unique individuals (NDES program) for ARM.
B. Radiation Stage
The candidates were finally reduced to the sorting (BubbleSort) and the cipher (NDES) algorithms, due to beam limitations. The individuals chosen for irradiation were: MaxACE, O0, O3, MaxMWTF, MinMWTF, and P.Pareto for BubbleSort and MinACE, MaxMWTF, and O0 for NDES.
The irradiation results are presented in Table I , which shows the SDC and the Hang dynamic cross sections, the MWTF values, and both the fluxes and the fluences for each individual. The aforementioned candidates were customized to use either Double Data Rate (DDR) (out of the incident proton beam) or on-chip OCM. The 95% confidence intervals are included in all cases. These confidence intervals are computed using the classical formula for the estimation of the Poisson mean [26] .
The OCM radiation setup, i.e., when all resources (memories and registers) are inside the beam, defines the most similar scenario to our MOOGA simulation. In this case, and looking at the BubbleSort section of the table, a remarkable match between the MOOGA estimations and the radiation results can be appreciated in the following terms. First, the individuals labeled by MOOGA as MaxACE and MinMWTF, i.e., the individuals with the worst fault coverage, obtained higher dynamic cross sections (3.8 · 10 −11 and 4.8 · 10 −11 , respectively). A minimum discrepancy can be seen where the MaxACE individual was supposed to have the highest dynamic cross section, but this was obtained by the MinMWTF version. One possible explanation is that additional cycles will have a decisive impact on fault coverage. The resources omitted during the fault injection campaigns (e.g., pipeline registers) could likewise lead to the same result, as suggested by other authors [7] . Second, the individuals showing the best fault coverage estimations (O0, MaxMWTF, and P.Pareto), as expected, presented a lower dynamic cross section than the others. Third, the MaxMWTF and P.Pareto individuals showed higher values of MWTF under radiation (4.67 · 10 14 and 5.45 · 10 14 , respectively). The P.Pareto version improved its execution time compared with the simulation. As a result, the P.Pareto version recorded better MWTF metrics in the irradiation experiment.
Similarly, in the NDES OCM section of the table, it can be seen that the MinACE and the O0 individuals, as expected, produced lower dynamic cross sections (5.9 · 10 −11 and 5.2 · 10 −11 , respectively) than the others. Also, MaxMWTF produced the maximum MWTF under radiation (5.47 · 10 14 ).
Regarding the cache-deactivated DDR radiation setup, where memory errors are minimized (out of the beam), the following effects were observed in the BubbleSort benchmark. First, the overall dynamic cross section was, as expected, reduced with respect to the OCM version. For instance, the MinMWTF version fell from 4.8·10 −11 down to 9.8·10 −12 and from 3.8·10 −11 down to 1.3·10 −11 in the case of MaxACE. Second, the MaxACE also offered the maximum dynamic cross section. Third, the MinMWTF version produced the worst MWTF value, 8.95 · 10 −12 , that was measured.
Finally, it is interesting to note that if the program fits in the cache, and then the behavior of both the OCM and the DDR versions will be similar. This case occurred with NDES O0, where the dynamic cross section varied slightly from 5.2 · 10 −11 in OCM up to 6.9 · 10 −11 in DDR. In addition, the cache disablement exposed a performance worsening side effect of 15×. For instance, BubbleSort MaxACE rose from 44.4 Kcycles in OCM to 666.0 Kcycles in DDR. Consequently, the MWTF was worse compared with the cache-on individuals.
All the above-mentioned examples provided evidence of a significant match between simulation and radiation results and validated the MOOGA approach for the generation of relevant individuals.
The analysis is complemented by Fig. 7 . It shows fault distribution between SDC and Hang, comparing simulation and radiation results. Also, the number of memory accesses is represented by a line (referred to as the right axis). Considering the OCM setup, a quick look at the BubbleSort individuals reveals that the SDC/Hang ratios are very similar in both cases (differences of less than 15%), regardless of the number of memory accesses. The NDES MaxMWTF individual presents a similar trend. However, for the MinACE and the O0 individuals, which perform a higher number of memory accesses, the SDC/Hang ratio is reversed. A possible explanation of this effect is the fact that data involved in the NDES calculus are more sensitive than those involved in BubbleSort. Regarding the DDR setup, and conversely to simulation results, the Hang percentage is dominant regardless of the number of memory accesses (see BubbleSort, MaxACE, and MinMWTF versions). It shows that out-beam stored data are less vulnerable to SDC faults.
VI. CONCLUSION
It has been demonstrated in this paper that reliability can be improved by tuning the compilation process. A blind automatic strategy has also been proposed to guide the search for the versions with the best tradeoffs between several objectives that influence application reliability. Despite the fact that modern compilers are not designed to generate reliable builds, they can be tuned to generate compilations that improve their reliability by means of simultaneous optimization of the fault coverage, the execution time, and the memory size. Moreover, it can be inferred from comparisons with the previous studies on a simpler processor that sophisticated hardware features play an important role in the reliability improvements that can be achieved through efficient optimizations of compilers.
