Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.
Platform HW. Microarchitecture describes how the HW constituent parts are connected and interoperate to implement the operations that the HW supports. It includes the memory system, the memory interconnect, and the internals of processors (Hennessy and Patterson 2011) . This applies both to very flexible SW-programmable processors, where an instruction set is present to control the operation sequence, and to dedicated HW processing components. Dedicated HW processors feature minimum to limited flexibility. Both SW-programmable and dedicated components can be mapped on highly reconfigurable fabrics, such as field-programmable gate arrays (FPGAs). The primary difference compared to the SW-programmable processors is that not only the control flow but also the dataflow can be substantially changed/reconfigured. The microarchitecture together with the instruction set architecture (ISA) constitute the computer architecture (although the term has been recently used to include also other aspects of the design (Hennessy and Patterson 2011)).
In general, the term hardware module denotes a subset of the digital system's HW, the internals of which cannot be observed (or it is chosen that they are not observed), correspondingly to the 50:4 G. Psychou et al. term black box (Rodopoulos et al. 2015) . To define a HW module, its functionality and its interface with the external world must be described. At the microarchitectural and architectural layer, examples of HW modules are a multiprocessor system, a single core, a functional unit, the row of a memory array, a pipeline stage, and a register (without exposing the internal circuit implementation though). In the context of this survey, the term platform hardware is an umbrella term that encompasses the microarchitectural and architectural layers of a system.
Mapping. During mapping, the algorithmic level specification is mapped into a preselected datapath and control path that implements the required behavior S . Today, the term is also used to denote how an application or an application set is split, distributed, and ordered to run in a multiprocessor design.
Platform SW. To enable SW-HW interaction, an instruction set is selected initially. The instruction set defines the HW-SW interface (Hennessy and Patterson 2011). Many application instances sharing specific characteristics (a "domain") can be mapped on the same instruction set. Each of the instructions in that set can then be implemented in the HW in different ways.
Platform SW includes several sublayers that interpret or translate high-level operations (derived from the algorithmic description) into "primitive" instructions, which correspond to the instruction set and are ready to be executed by the HW. Examples include system libraries, OSs, and runtime managers S .
Additional Terminology.
A control dataflow graph (CDFG) is a graph representing all possible paths that the flow of data can follow during execution. An application corresponds to a separate CDFG in the system. A process is an instantiation of a program, or a segment of code, under execution consisting of "own" memory space, containing an image of the executable code and data, resource descriptions, security attributes, and state information (register content, physical memory addressing, etc.)-that is, all information necessary to execute the program S . Threads are sequences of instructions, or a flow of control, in a program that can be executed concurrently. All threads in a given process share the private address space of that process S . The term task is used quite ambiguously in the literature. On the one hand, the terms task and process are used synonymously. On the other hand, the terms process and thread are considered as "mechanic," whereas the term task is considered more conceptual and used in the context of scheduling as a set of program instructions loaded in memory for execution. The term task in this article is used as an umbrella term that can denote complete applications, subparts of the CDFG such as processing kernels (e.g., for-loops), or even single computations (e.g., instructions), depending on the context.
Rationale of the Classification and Its Presentation
The proposed classification tree is organized using a top-down splitting of the types of techniques that increase system resilience. It is accompanied by a mapping of related work (Figure 2 ). The top-down splitting allows to reach a comprehensive list of types of techniques, which can always be expanded further on demand. Splits are created based on properties of the techniques, which allow them to be grouped together. More specifically, the properties in the proposed framework regard (1) the effect that the techniques have on the execution and (2) the changes that are required on the system design for a technique to be implemented. The properties will be elaborated as the tree is being presented.
Other organizations are also possible, such as organizing the splits around the system functionality, HW components, types of errors (transient, intermittent, permanent) , types of resilience metrics, or the application domains. The aforementioned organization is chosen to stress the reusability of techniques but also to enable the better understanding of hybrid combinations. This is especially supported through the complementarity of the categories. It is important to note that many actual approaches that increase resilience typically represent hybrids and do not fall strictly into only one of the categories.
For the presentation of the classification tree, the following structure is followed for each of the abstraction layers (platform HW, mapping, and platform SW). First, the main classes are presented for the different techniques. Within each class, subcategories are presented, which are illustrated with the help of a figure. Groups of nodes are chosen to be discussed together. For the visualization of the groups, bubbles with different colors are used, along with the subsection number and a small geometrical shape (see Figure 2) . The colors and the geometrical shapes are used to enable a more explicit link with the corresponding sections in the text. Especially the geometrical shapes are used for the facilitation of the reader in the black-white printed version. The order of 50:6 G. Psychou et al. the leaves, the colors, and the geometrical shapes do not indicate the significance or the maturity of the techniques. For each of the classes, pros and cons are discussed based on general properties bound to each class. Among the aspects considered are area and power overhead, performance degradation (in terms of additional execution cycles), mitigation latency (delay until the scheme fulfils the intended mitigation function), error protection, general applicability, and storage overhead. An overview of those for the different classes can be found in Tables 2 through 7 in the Appendix (see the supplementary material). In parallel, representative related work is discussed to further illustrate the subcategory concept and demonstrate the usefulness of the proposed classification scheme for classifying existing (and future) literature S . Moreover, in Tables 2 through 7 in the Appendix, a crude indication of the amount of literature for each of the classes is performed.
Finally, the notion of nondeterminism is introduced and will be discussed in the article whenever appropriate. A common technique to mask the effect of errors is by employing replication. During replication, an algorithmic function is executed again, often by using extra HW or SW. However, deterministic execution is required for replicas to work. Determinism ensures that different runs of the same function under the same input will produce identical outcomes. In practice, deterministic execution is challenged by a multitude of nondeterministic events (Poledna 1996 (Poledna , 2007 Slember and Narasimhan 2006) . Examples include nonpredictable user or sensor inputs, timers, random numbers, system calls, and interrupts S .
PLATFORM HW
To make digital systems more robust, functional capabilities need to be provided that would be unnecessary in a fault-free environment. This section focuses on techniques that modify the HW capabilities for reliability purposes. The goal is to provide nonoverlapping categories that cover the broad range of error mitigation and resilience techniques. The complete classification scheme is shown in Figure 12 in Section 3.5. A high-level split for the proposed classification tree is shown in Figure 3 . Techniques are first classified into techniques that continue the execution forward (forward) and those that move the execution to an earlier point (backward). Both categories are further split into techniques that require the addition of HW modules in the platform at design time (additional HW modules provision) and techniques that keep the amount of modules the same (HW modules amount fixed). In the latter case, only a HW or SW controller may be needed. These four classes are discussed in the following subsections, as shown in Figure 3 . Main criteria for further categorization include whether modifications are required in existing functionalities, existing design implementations, resource allocation, operating conditions, the interaction with neighboring modules, and storage overhead. Leaves of the tree have an accompanying simple ordinal number for identification. The numbers (together with the leaves) are collectively shown in Figure 12 .
Forward Execution: Additional HW Modules Provision
This section discusses techniques that increase the resilience through adding HW modules on the platform. The added modules may have either the same (same functionality) or different (different functionality) functionality. The structure of this subtree along with the corresponding subsections is illustrated in Figure 4 .
Same Functionality .
This group includes techniques that add HW modules of the same functionality as the one(s) that should be protected. Some of the most known and well-established fault-tolerant techniques are found in this category. The provision of additional HW modules can be further categorized into modules that are used in parallel execution mode and modules used as spares. Parallel execution denotes that the modules are all active and processing operations (or hold/transfer data and instructions for processing). The term spares denotes that the added modules are not all executing in parallel with the default ones. They will only start executing upon certain conditions.
Parallel execution 1 2 . In general, parallel execution implies that the modules are all actively used for the intended functionality, or at least potentially when the workload is very high S . The term lockstep denotes a mode of operation according to which HW modules execute the same operations regarding the same program at the same time. Generally, lockstep processing can be "tight" or "loose" depending on whether the outputs of the modules are synchronized at the operation level or only selectively, such as at the I/O level (Aggarwal 2008) . Lockstep processing is used to make a system more robust either by masking an error, such as by allowing the correct output to be produced independent of which module caused the error, or by using explicit knowledge of the faulty module.
In the first case, multiple modules (N modules in the general case) with the same specification as the primary module are provided, and majority voting is applied at their output. No error detection is required as the error is masked through the voting. This results in a well-known technique called N-modular redundancy (NMR). Typically, N is an odd number to avoid uncertain output votes. Most 50:8 G. Psychou et al. often, the scheme has been employed in the form of triple modular redundancy (TMR) so that a correct output is produced with a two out of three vote.
Lockstep processing can be combined also with system awareness of the faulty module. In this case, a separate detection scheme is employed for the identification of the faulty module. Majority voting is not required, as after the detection, the faulty module is considered not valid any longer. Only the output of the other module(s) is considered valid. Thus, in this case, only two modules operating in lockstep suffice for producing a correct output. 3 One technique commonly found in literature, belonging to this category, is the so-called pair-and-spare 4 technique. In pair-and-spare, two pairs of replicas operate in lockstep, as illustrated in Figure 5 . Within each pair, error detection is performed through a comparison circuit. In the presence of an error, the faulty pair declares itself as faulty. Then the output of the other pair is selected as the valid one S . Replica determinism is not an issue here, as the processors perform their operations simultaneously and operate on identical inputs (Poledna 1996) . Pros in this class include the high error protection, the lack of latency and performance overhead, and the general applicability. Cons include the very high area (e.g., 200% for TMR) and power overhead. Literature examples on the aforementioned concepts include Dickinson et al. (1964 ), Jewett (1991 , and May et al. (2008) on TMR, Stratus computers, and the VAXft 3000 minicomputer (Siewiorek 1990 ) on pair-and-spare S .
Spares 2 . In this category, the added modules, which deliver the same functionality as the original ones, act as spares. The role of spare modules can be potentially dual. The first use of spares is to remain in standby mode and take over execution when the primary module fails S . The second use of spares is to take over execution (or be included in the system operation) for part of the time, without the primary module experiencing some failure. That means that the execution can potentially alternate between the spare and the primary module. Several reasons can motivate the undertaking of such a scheme. One possibility is related to the benefits coming from sharing the workload (in time). For example, it is known (He et al. 2011 ) that the device stress, which contributes to the system aging, is increased when there is a full workload operation compared to when there is alteration of active and inactive periods. Through alternating the execution between a primary and a spare, the lifetime of the system could be expanded. Another possibility is that the modules (original and spare) have partially different internal implementation, which gives them characteristics that fit better for certain conditions. In this case, the execution may alternate depending on the changing application requirements, for example, due to changes in the input workload or in environmental parameters (e.g., noise or temperature) S . Pros include the high error protection and lack of performance overhead. Cons include the area overhead. The power overhead can be avoided depending on whether the spares are powered or not, and this is a trade-off with latency (see supplementary material). The approach is generally applicable, except if spares are tailored to fit changing application requirements. Literature examples on the aforementioned concepts include Chean and Fortes (1990) and Srinivasan et al. (2005) on spares with failing modules and Shin et al. (2008) and Narayanan et al. (2010) on spares with working modules S .
Different
Functionality . This group includes techniques that add HW modules of different functionality than the one(s) that should be protected or become more robust. Again, a distinction can be made between modules that are in parallel execution mode and modules that act as spares.
Parallel execution 3 . Here, the added module performs different functions than the original module. Several possibilities exist. A category includes hybrid schemes, according to which the added modules that are designed to be more robust (e.g., by employing circuit-level techniques). The added module can perform only a subset of the operations of the original module for verification purposes-that is, it is a module with reduced functionality. For example, the most crucial operations or the ones that cannot be performed (repeated) by any other of the already existing modules on the platform may be performed by the added module. Since it is designed to be more robust, its output is assumed to be the correct one. Another possibility is that the added module performs a superset of the operations of the original module, namely it is a module of increased functionality, which means that it performs the operations of the original module plus additional operations that are normally performed by other modules on the platform. This would be the case when the added module would act like a supervisor for several modules. An additional possibility is that the added module performs different types of functions. For example, it may perform some error correction. Given that the HW module granularity can go down to a register, the error correction codes (ECCs) are placed in this category. They are typically implemented in memory structures but also in buses, state machines, and arithmetic units. Figure 6 shows an example of a single bit correction with the Hamming code (Hamming 1950) . Syndrome bits are created during the read operation. If a single error occurs, the syndrome identifies the erroneous bit. Pros include the flexibility to trade-off area, power, performance overhead, and latency with the error protection by selecting a fitting functionality to be added. Cons include that this class generally requires systemspecific solutions (although for ECC, reusable concepts are typically applied). Literature examples on the aforementioned concepts include algorithmic noise tolerance (ANT) (Hegde and Shanbhag 2001) on modules with reduced functionality, and Hamming (1950) and Dutt et al. (2014) on ECC S .
50:10 G. Psychou et al.
Spares 4 . As already discussed, spare modules can be present to take over execution in case the primary module fails or to take over execution for part of the time, even if no failure is present. A reduced functionality spare module is able to continue execution at a reduced power and area overhead but also at a degraded performance (since only part of the functionality is available). An increased functionality spare module is able to continue execution in an environment where the primary module has been shown to be not good enough. By using its additional functionality, it keeps or improves the reliability target (at extra area and power cost) S . Similarly to the earlier category, pros include the flexibility to trade off area, power, latency, and performance with error protection by selecting the appropriate solution. Cons, generally in this class, include that system-specific solutions are required. In the literature, techniques that employ spares with reduced functionality have been identified. Examples can be found in Tomayko (1986) S .
Forward Execution: HW Modules Amount Fixed
This section discusses techniques that use only the same amount of modules on the platform as the original system (before reliability-related countermeasures are added). HW modifications (e.g., adding interconnects) may be required, but no additional HW module is added. A HW or SW controller is often needed to coordinate the actions. These techniques are split into techniques that reuse the existing HW modules (existing HW modules) and those that replace one (or more) module with an alternate to make the system more robust (alternate HW modules). The first category is further split into techniques that either change the way of operation of the HW modules (HW modules operation mode) or leave the operation unaltered and change the way the workload is mapped on these HW modules (resource allocation). Changing the operation of the HW modules means that the changes have as focus either the functionality (functionality control) or the operating conditions (operating conditions control). Functionality-oriented modifications either focus on the internals of a HW module so that the intended module usage is exploited for reliability purposes (internal functionality reuse) or on the input-output behavior of the module and how it interacts with the other modules (I/O configuration modification). Figure 7 shows the proposed subtree and its division into subsections.
Internal Functionality Reuse 5 .
Techniques belonging to this category are very system/ application dependent. For example, communication or signal processing systems typically have blocks that perform channel or source coding. Channel decoders mitigate errors introduced by the channel and can be potentially reused to mitigate HW-induced errors. Pros include the lowest possible area and power overhead due to the reuse. Cons include the lack of general applicability, latency, possible performance costs, and limited error protection. Literature examples that reuse the channel decoder include Khajeh et al. (2012) and Brehm et al. (2012) S .
I/O Configuration
Modification . This group of techniques reorganizes the interaction of a module with the other modules. This can potentially mean a different way of connecting or communicating (connectivity with neighboring HW modules) or even an isolation action (isolation capability), during which an erroneous module is bypassed from the system.
Connectivity with neighboring HW modules 6 . Intermodule techniques can exploit inherent redundancy typically present in regularly structured systems, such as arrays of processing elements (PEs), to increase the masking and correction capability of the system. Today, high performance is achieved primarily by chip multiprocessors (CMPs). CMPs are composed of multiple cores located in a single die or on multiple dies in a single package. The types of cores may vary from simple, in-order processors to more complex, superscalar ones. They enable high performance through parallel computation. The CMPs are used here as driver, but the ideas can be applied to other regularly structured systems where reuse is possible. The availability of the cores can be exploited to create masking capability by, for example, running a process in three cores in parallel in a TMR structure, or the HW itself can be built as reconfigurable so that the modules can be connected in a different way depending on runtime conditions. Typically, this last possibility is found in the form of a hybrid. For example, it is often found together with spare modules. Pros include the low area and power overhead (due to the reuse of existing modules but with additional cross links), the general applicability (for systems with inherent redundancy), and the potentially high error protection. Cons include the latency and blockage of resources for reliability that could be used to improve performance. A literature example that employs a modified connection network in CMPs is found in Aggarwal et al. (2007) S .
Isolation capability 7 . To prevent erroneous results from corrupting the system output, faulty components can be bypassed (through a switch) or powered off, in case such an isolation capability has been added in the system. The system continues operating but at a degraded performance. These schemes exploit inherent redundancy in regularly structured systems, such as arrays of PEs, memories, and interconnection networks or even processors. 5 Pros include low area and power overhead, and general applicability (for systems with inherent redundancy). Cons include latency, degraded performance, and limited error protection. Literature examples of the concept include Srinivasan et al. (2005) and Bower et al. (2005) on structures within processors, and Gupta et al. (2008) and Romanescu and Sorin (2008) on pipeline stages in CMPs S .
Operating Conditions Control 8 •.
Operating conditions represent the interference caused to a digital system by its environment (Rodopoulos et al. 2015) . This covers a broad range 50:12
of effects such as radiation, temperature, and humidity but also electrical stimuli. This category includes all actions that influence the operating conditions of the digital system, beyond changing the system's functionality.
Typically, operating parameters such as the supply voltage and the clock frequency are controlled to manage the performance, power, and reliability trade-offs. Scaling the voltage beyond a critical limit can lead to excessive error rates. However, using conservative guard bands for the voltage setting can lead to significant power overhead. Pros include lack of area overhead and general applicability (assuming that knobs are present in the system for power management). Cons include the latency and limited error protection. Power and/or performance will typically be affected depending on the knob being used. Here, works that implement control algorithms that change operational parameters are classified, such as the examples of Karl et al. (2006) and Rosing et al. (2007) S .
Resource Allocation 9 .
Here, the manner in which HW resources are assigned is modified without changing the way of operation of the HW modules. Simply, the task is migrated or swapped with another task. Pros include the limited area, power, performance overhead due to the modules reuse (with the exception of adding specialized interconnects), and the rather general applicability (for systems with inherent redundancy). Cons include latency during migration and limited error protection. Literature examples on HW-based task migration include Powell et al. (2009) and Venkataraman et al. (2015) S .
Alternate HW Modules 10
✖. This category includes schemes that replace an existing HW module with another more robust implementation for the system context (without employing circuit or lower-layer techniques). Pros include the limited area and power, and performance overhead, as the new implementation will typically satisfy the system requirements while minimizing additional cost. Cons include that system-specific solutions are required (if existing at all), and typically only limited error protection will be possible. Literature examples in this category include Hussien et al. (2010 Hussien et al. ( , 2011 on alternate channel decoders S .
Backward Execution: Additional HW Modules Provision
This section discusses techniques that increase the resilience of systems through rollback to an earlier point of execution and repetition of the execution. Just like in the forward execution category, the added modules can have either the same (same functionality) or different (different functionality) functionality. The corresponding categories and subsections are shown in Figure 8 . 6 3.3.1 Same Functionality 11 . This category discusses techniques that provide additional HW modules with the same functionality as the original ones. The recovery is achieved by when an error is detected. In this category, the second module plays an active role in the recovery. For example, it can activate the execution repetition or provide necessary information to the first module so that the execution is repeated successfully. When the second module executes the same instruction sequence, nondeterministic execution is not a concern, as long as identical inputs can be provided to both modules. Pros include the potentially high error protection (at the expense then of performance and latency). Moreover, the technique is generally applicable. Cons include the high area and power overhead. A literature example in this category is Pflanz and Vierhaus (2001) S .
Different Functionality 12 .
Instead of adding modules with the same functionality, modules with different functionality can be added; the added modules play an active role in the recovery as in the previous category. The added modules can be with reduced or increased functionality as in the corresponding forward category for similar reasons. Pros include the flexibility to trade off area, power, performance, and latency with error protection depending on the selected functionality. Cons include that the solutions are rather system specific. A literature example in this category is Austin (1999) S .
Backward Execution: HW Modules Amount Fixed
The majority of the techniques proposed in the literature that employ backward execution reuse the already existing HW modules, as the additional area overhead of the previous category is avoided. This can be achieved by techniques that retry the execution without explicit storage (retry without state storage) and techniques that retry by storing some (redundant) system information at intermediate execution points to be used for system recovery (retry with state storage). 7 Checkpointing is a term that refers to the intermediate storing of the application's state (or of part of it), such as register and memory contents. Additional events may be registered as part of the state, which are called logs. The corresponding categories and subsections are shown in Figure 9 .
Retry Without State
Storage . This category includes the schemes that move back the execution to an earlier point and repeat it upon error detection. The execution can be successfully repeated without explicitly storing the system state either because the state information is not really needed or because it is provided indirectly by executing another task, which produces the required information. In the degenerate case, a HW-driven restart/reboot procedure can be triggered to remedy transient errors. The techniques can be further distinguished into techniques that take place within the boundaries of a single module (i.e., intramodule) and techniques that operate across modules (i.e., intermodule), as shown in Figure 9 .
50:14 G. Psychou et al. Intramodule 13 . In this category belong schemes that either exploit inherent features of processors to retry a task execution, such as instruction retry or cache refetch, or employ additional HW-based tasks.
For example, Ray et al. (2001) propose to use the preexisting instruction rewind mechanism present in superscalar machines for branch mispredictions to handle error recovery. After detecting an error (by duplicating the instruction during the decode stage and comparing the results before committing), the contents of the reorder buffer (ROB) are flushed and the instruction is reexecuted, similarly to what happens upon a branch misprediction event (Figure 10(a) ). In case the results agree after cross checking, a single instruction retires and execution proceeds.
HW-based tasks in the literature are implemented by simultaneous multithreading (SMT) to increase on-chip parallelism. SMT is a technique that allows multiple threads to issue multiple instructions each cycle on a superscalar processor (Tullsen et al. 1996) . The threads can be separate from each other or coupled to each other. Separate threads could be potentially created to execute the same program in a TMR structure, assuming that care is taken so that the threads use identical shared resources. The literature focuses on employing HW-based threads in coupled execution mode. According to this mode of operation, the threads communicate with each other-that is, one thread uses some knowledge from the other thread(s) to execute the program. Coupled execution has been used with processors to speed up execution, and the idea has been reused for fault tolerance (Sundaramoorthy et al. 2000) . The concept is as follows. Two streams of the same program run in parallel but with a time lag (see Figure 10(b) ). The first stream is a less accurate one, as it processes fewer instructions than a complete stream would. It bypasses certain computations and branch instructions as indicated by a HW monitor that has observed past instances. Thus, it can run faster than a complete stream. Its results are stored in a delay buffer. The second stream is an accurate one, as it executes all of the instructions. However, it receives information from the first stream through a delay buffer, which allows it to run faster. For example, it uses memory load values and thus can avoid memory latencies. When the second thread commits (writes its results to the registers), the results from both threads are compared. If they are not identical, the results of the second one are used to restore the system state.
Nondeterministic events such as traps and exceptions are handled with some minimal support from the OS. The first stream stalls until the delay buffer completely empties and the second stream is terminated. The first stream is serviced (by the OS) and execution resumes. Pros include the low area and power overhead, potentially high error protection (but only for transient errors), and rather general applicability. Cons include the latency, performance overhead, and the limita- Intermodule 14 . This category has similar properties to the preceding one but requires the cooperation of modules. Pros and cons are similar to the previous category, but here also permanent errors can be handled and synchronization issues have to be addressed. In the literature, mainly examples that include tasks with time lag have been identified: Sundaramoorthy et al. (2000) and Gomaa et al. (2003) S .
Retry With State
Storage . This group of techniques employs the storage of a complete or partial error-free state and the rollback to that state upon detection of an error. Afterward, the execution is repeated to acquire error-free results, assuming that the error was transient. The techniques are also distinguished into intramodule and intermodule. In the latter category, issues that have to do with the state synchronization among several modules have to be addressed. Checkpointing/rollback refers to a widespread concept, according to which the state of a process is proactively stored at certain intervals during execution so that the correct state is restored in case of an error. The majority of prior work realizes SW-based checkpointing/ rollback schemes. However, groups both in industry and academia have provided fully HW-based implementations. Typically, when these schemes address nondeterminism, this is done by synchronizing the checkpoints with the external events (e.g., interrupts). Namely, when an external event takes place, a checkpoint is forced.
Intramodule 15 . This category includes schemes that store the whole state or a subpart of the state of a module to restart the execution from that stored point if an error occurs. The storage can take place in the main memory, hard disk, register file, or memory buffers and is often complemented by another error resilience technique, such as ECC, to be more robust. A broad range of checkpointing techniques exist, from techniques that store checkpoints very rarely (every thousands up to billions of cycles) assuming low error rates to techniques that perform checkpointing very often (every few cycles) assuming high error rates. Pros include the high error protection (for transient errors only) and the general applicability. Cons include latency (depending on the checkpointing granularity), performance (depending also on whether checkpointing is overlapped with normal execution), and the limitation to transient errors. Area and power overhead is medium. Literature examples include Ahmed et al. (1990) Intermodule 16 . Such schemes are typically found in multicore architectures. Here, on top of the external nondeterministic events, such as interrupts, internal events also have to be taken care of, such as the accesses to the shared memory. These checkpointing schemes can be characterized as global and local. In the global schemes, common checkpoints are created among all modules and upon detection all modules have to roll back to an earlier state (even when many of 50:16 G. Psychou et al. them are error free). Figure 11 illustrates the concept. A challenge with this approach is the scalability as the number of cores increases. However, local checkpointing schemes allow such actions to be made by a subset of the modules performing only local synchronization and information storage. A taxonomy of HW-based checkpointing schemes for CMPs can be found in Prvulovic et al. (2002) . Pros and cons are similar to the previous category but with extra synchronization costs. Global schemes induce more overhead during checkpointing but have a simpler recovery compared to local schemes. Literature examples include and Agarwal et al. (2011) on local schemes and Sorin et al. (2002) on global schemes S .
Overall Platform HW Classification
The subtrees presented in the previous sections are combined to form the overall classification tree for platform HW techniques, as shown in Figure 12 . Starting from the top-level split of Figure 3 , 
PLATFORM SW
In this section, techniques that extend the platform SW capabilities for reliability purposes are presented. The proposed classification is built around the notion of tasks in a similar way that the earlier section was built around the notion of HW modules. The complete classification scheme is shown in Figure 22 in Section 4.5. As in the platform HW section, the first split in the platform SW techniques is between (forward) and (backward) techniques. Forward techniques are then further split into techniques that require additional tasks (additional tasks provision) and techniques that cope with the existing amount of tasks (tasks amount fixed). Backward techniques are split into techniques that require the reexecution of tasks without storing state information (retry without state storage) and techniques that do require such intermediate state storage (retry with state storage). Since reexecution of tasks is a prerequisite for a technique to be characterized as a backward technique, backward techniques always include additional tasks (a reexecution can be seen as providing an additional identical task in time). These four classes are discussed in the following sections, as shown in Figure 13 S . Main criteria for further categorization into classes include whether modifications are required in existing functionalities, existing task implementations, the resource allocation, the interaction with neighboring tasks, execution mode (of additional tasks), and cooperation among HW modules. Here, end nodes are accompanied by an ordinal number.
Forward Execution: Additional Tasks Provision
This section discusses techniques that increase the resilience of systems through providing additional tasks using the platform's SW without moving the execution to an earlier point. The added tasks may have either the (same functionality) or (different functionality). The structure of this subtree and the corresponding subsections are illustrated in Figure 14 . These techniques can be differentiated according to the granularity of the replicated task (instruction, thread, process) and/or according to the abstraction layer in the SW stack (see Section 2.2.1) where the replication takes place.
4.1.1 Same Functionality . As in the corresponding platform HW category, this category can be further split into tasks that are in parallel execution mode and tasks that act as spares.
Parallel execution 1 . The parallel execution implies that the HW resources are available, as in the case of a multicore architecture. If additional HW resources are provided but the tasks run at the higher SW stack, then this category is a hybrid with the (additional HW modules provision/same functionality) category. Here, similar concepts can be applied as in the corresponding platform HW 50:18 G. Psychou et al. category, but differences exist as well. Parallel tasks can run in lockstep. They may be configured in an execution mode, according to which awareness of the faulty task is not required, since the error is masked, as in an NMR or TMR structure combined with voting. The alternative is that the faulty module is identified (some explicit detection/diagnosis scheme is present) and then only the outputs of the other running task(s) are considered valid. The concept of using multiple tasks in parallel and applying majority voting at their outputs has been presented in the literature for a long time. Execution among the redundant processes/threads must be deterministic. The literature has focused mostly on employing such schemes on multicore architectures. Thus, the first source of nondeterminism that has to be tackled is the shared memory accesses.
When the parallel execution is combined with fault awareness, majority voting is not necessary. Two tasks are typically sufficient to have a robust execution. In the presence of nondeterministic events, such as I/O operations, the literature deals with this category typically in the following manner. One of the two tasks (primary) executes I/O operations, and the other task (backup) gets informed about the results. If the primary task is declared faulty, then the backup takes the role of the primary and continues operation, including I/O operations.
Pros include the high error protection (for the task that is protected), and the lack of latency (except if there is lack of empty slots during scheduling), storage, and power overhead (assuming that the HW resources would be used anyway). Depending on the SW stack level implementation, different degrees of transparency can be achieved. Cons include the blockage of resources for replicating functions leading indirectly to performance overhead. Literature examples include Avizienis (1985) and Fiala et al. (2012) Figure 15) generates the replica threads and performs the output checks S .
Spares 2 . As discussed in Section 3.1 in the HW category, a spare can have a dual role: to take over execution when the primary module fails or, potentially, to take over execution for part of the time so that the execution alternates between the primary and the spare, depending on the objectives. Having a task with the same functionality as a spare could make sense, for example, if upon failure of the primary task the spare task is loaded from a different instruction memory location (which may be considered more robust). The execution would continue forward (past errors would not be corrected), but the loading action could prevent future error manifestations due to corrupted instruction memory S . Pros would include the limited storage, area, performance overhead, latency, high error protection (only for instruction memory errors), and general applicability. Cons would include the limitation to instruction memory errors. Literature schemes that use alternate tasks as spares in a forward mode have not be found.
Different Functionality .
In this case, the added task(s) deliver a different functionality than the one that should be made more robust. This category can also be distinguished into tasks in parallel execution mode and tasks that act as spares.
Parallel execution 3 . In principle, similar types of techniques can be applied as in the corresponding platform HW category. The added task may perform a subset or a superset of the functions of the original task, but because of some complementary technique (i.e., it is a hybrid), it is assumed to be more robust. The complementary technique may be that the PE on which the added task runs is configured to be more robust, or the added task performs some different function, such as error correction. It must be noted that parallel execution in this context does not necessarily imply that the additional task runs at the exact same moment as the original one, but it is active in parallel with the original task during the system lifetime. For example, it may be executed more sporadically (e.g., periodically). Pros include the flexibility to balance among storage, power, performance, latency, and error protection through appropriately selecting the added function. Cons include the need for system-specific solutions and the blockage of resources. A literature example is Shirvani et al. (2000) S .
Spares 4 . Tasks that have a different functionality can also be potentially used as spares. For example, the backup task may have some inherent error correction coding, which makes it more appropriate than the original task for given external conditions. Pros and cons are similar to the previous category; however, in this case, the added task runs instead of the original task, leading to partially different costs depending on the exact implementation. Literature schemes that use spare tasks with different functionality in a forward execution mode have not been found.
Forward Execution: Tasks Amount Fixed
This section discusses techniques that do not provide additional tasks in the system to make it more reliable. Initially, these techniques can be distinguished between techniques that make use of the existing tasks on the platform (existing tasks) and techniques that replace the implementation of existing tasks with an alternate implementation (alternate tasks). The techniques based on the existing tasks can be split into techniques that manipulate the functionality of the tasks (functionality control) and techniques that rearrange the allocation of the tasks into the HW resources (resource allocation). Figure 16 shows the proposed subtree and its division into subsections. 
Functionality Control . Techniques that are focused around the functionality of tasks can either operate within the task boundaries by reusing the task functionality (internal functionality reuse) or operate outside the task boundaries by rearranging its interaction with the other tasks (I/O configuration modification). The latter can be further split into techniques that reorganize the execution sequence (scheduling-ordering) and techniques that isolate results from corrupted tasks (isolation capability).
Internal functionality reuse 5 . In this category, some knowledge of the inherent task functionality is exploited in a useful way to increase resilience. By configuring some parameters through the platform SW, a more resilient execution is achieved. Although a literature example in the present context has not been found, such a scheme could be similar to the following example. Pant et al. (2012) consider several parameters for the H.264 encoding: subpixel motion estimation, size of motion estimation search window, DCT window size, and run-length encoding mechanism. They show that adapting those parameters can utilize the application algorithm's quality or performance trade-offs to achieve error-free operation in the presence of permanent manufacturing variations. Pros include the low storage and power overhead. Cons include the need for very system-specific solutions and the limited error protection. Performance and latency may be affected depending on the exact implementation. 6 . These schemes reorganize the application or instruction profile so that the reordered execution is more robust. Typically, additional information is used regarding the vulnerability of certain instructions or operands (and thus registers). By rescheduling the flow, the interval in which the most vulnerable operations and operands are used is minimized. Pros include the very limited storage, power overhead, and rather general applicability. Cons include the need for additional information to guide the rescheduling and the limited error protection. Performance and latency may be affected. Literature examples at the instructionlevel include Yan and Zhang (2005) and Rehman et al. (2012) S . 7 . This class includes techniques that isolate tasks so that errors do not propagate to subsequent tasks or the output. To be more accurate, the focus is on the results of the tasks. This requires some application knowledge to ensure that the impact of the discarded computations is not destructive. Pros include the lack of storage, power overhead, and latency. Cons include the need for system-specific solutions, the low error protection (through isolation), and the potential performance degradation. A hybrid example is de Kruijf et al. (2010) S .
I/O configuration modification/scheduling-ordering

I/O configuration modification/isolation capability
Resource Allocation 8 .
This category includes techniques that change the assignment of tasks and data to HW components in a way that the most reliable match between task and HW module is found. This can be in the context of a single processor or a multiprocessor system. To find the best match, information regarding the vulnerability (or robustness) of the tasks and of the HW modules has to be identified first. The category also includes the case where although the task was initially running in one core, it is migrated to another compatible core (task migration). Pros include the very limited storage, power overhead, and rather general applicability. Cons include the need for additional information to guide the reallocation and the limited error protection. Performance and latency may be affected. Literature examples include Yan and Zhang (2005) on register allocation, and Rahimi et al. (2013) and Chakravorty et al. (2006) on task-processor mapping S .
Alternate Tasks 9
•. Here, an existing task is modified and replaced by an alternate one with a more robust implementation without providing additional tasks and without implementing a different algorithm. Typically, this is driven by information regarding vulnerable parts of the HW and is fed into the compiler, which performs the alterations. Pros include the lack of storage and latency. Cons include the limited error protection and rather system-specific applicability. Depending on the applied scheme, power and performance may be affected. Literature examples include altered code that bypasses faulty HW (Meixner and Sorin 2008) and altered code that reduces the amount of critical instructions (Rehman et al. 2011) S .
Backward Execution: Retry Without State Storage
This section discusses techniques that increase the resilience of systems by retrying the execution using the SW stack without the need for storing intermediate state. Such schemes move back the execution to a point that information has been stored by the system itself due its normal functionality (without extra reliability-related storage overhead). In the degenerate case, a system can be restarted (e.g., the OS can reboot in the presence of an error). For a long time, such retry mechanisms have been integrated in correcting transient disk and memory read errors (Siewiorek and Swarz 1982) . However, in recent works, existent system features are exploited, allowing reexecution without additional storage actions. A task can be reexecuted without requiring extra storage either because there is another task running in parallel that keeps the state updated but potentially with a delay (parallel execution) or the task is being started from the beginning (sequential execution), so intermediate state information is not needed. Figure 17 shows the corresponding subtree.
Parallel Execution 10 .
This category assumes that additional HW resources are available. When HW resources are explicitly added for the implementation, the technique is a hybrid with the (additional HW modules provision/same functionality) category. The so-called primary/ backup technique assumes that two tasks are executing in parallel, with one of them lagging behind the other. However, only the primary processor handles communication with external devices. The backup will take up this role once the primary processor fails. As the execution between the copies should be deterministic, extra care has to be taken for the handling of nondeterministic events. Typically, these are handled by the primary module, and the result is passed afterward to Bressoud and Schneider (1996) . the backup module. Implementation of the primary/backup technique can take place at several layers of the SW stack. Pros include the high error protection at minimum storage and rather general applicability. Cons include the power, performance overhead, latency, and blockage of resources. A literature example in this category is Bressoud and Schneider (1996) . The replicas are implemented at the virtual machine level and are running on different physical processors. The primary task handles I/O communication, and the backup task lags a few instructions behind the primary. After a specific number of instructions have been executed, called epoch (Figure 18 ), the hypervisor communicates the interrupts that the primary received and accompanying data to the backup S .
Sequential Execution .
This group includes techniques that restart the execution of the task from the beginning. As is typical in the current framework, tasks at different granularities can exist, such as instruction, thread, and process. But the focus here will be on the task notion, as it is meant in a broad category of techniques that is covered here, called fault-tolerant scheduling techniques. These techniques are primarily employed for real-time systems. Such a system processes a set T of tasks. A task can be a unit of work, such as a granule of computation, a unit of data transmission or a file transfer (Liu et al. 1994) , or in general a thread or a process that has to be ready by a certain deadline. Each task has a release time (and a deadline). If the execution finishes before the deadline for all of the tasks, then the application is successfully executed. Resilience is achieved by reexecuting a given task when it fails. The goal of a fault-tolerant scheduling is to guarantee that the total execution time of the task set, including possible delays from reexecution due to (transient or permanent) faults, meets the system (hard) deadline. If this is not possible, the task set is rejected.
The tasks may be periodic (with equal or varying periods), aperiodic, or sporadic (Sprunt et al. 1989) . Periodic tasks are released every Pi seconds (task period), whereas the release time in aperiodic tasks varies. Sporadic tasks are aperiodic tasks that have hard deadlines. Each release of a task is called an instance. Sporadic tasks may have an upper limit in their rate of arrival. For sporadic tasks, the system has to create the schedule as the tasks arrive and not offline S . The task can be reexecuted either on a single processor so that transient errors are removed or on a different processor so that permanent errors are avoided. Therefore, this category is further split into intramodule and intermodule techniques. In intramodule techniques, the literature does not generally address tasks that are amenable to nondeterministic events. In intermodule techniques, this issue is partially addressed, especially concerning the shared memory accesses.
Intramodule 11 . To deal with transient faults, a task is reexecuted from the start on the same (single) processor-either the same task or a different (e.g., lighter) version of the task. To make a fault-tolerant scheduling, extra slack is inserted in the schedule, enough to allow the task reexecution upon fault occurrence. This extra slack is also referred to-symbolically-as a backup task S . Pros include the limited storage, medium power overhead, rather general applicability, and potentially high error protection (only transient errors). Cons include the performance overhead, latency, and limitation to transient errors. Literature examples include Reis et al. (2007) and Rehman et al. (2013) on executing additional instructions, Pandya and Malek (1998) and Ghosh et al. (1998) on fault-tolerant scheduling of periodic tasks, and Liberato et al. (2000) on aperiodic tasks S .
Intermodule 12 . The following group of techniques is applied on a multiprocessor system and addresses not only transient but also permanent errors. A different processor is used to reassign the task execution. Typically, the alternate processors have private memory. Each task has a backup copy. In general, the backup tasks may be passive or active. Active tasks are scheduled and execute independent of whether an error occurs. The passive copies of the tasks are assigned and scheduled on different processors and will execute only if an error occurs. The copies may be preloaded in the memory of each processor (before execution time). Alternatively, only the required copies are loaded at runtime in a demand-driven way. The tasks may be periodic (with equal or varying periods) or sporadic. In these techniques, two policies need to be derived: one for the task allocation to each processor and one for the scheduling of the tasks within each processor. Pros and cons are similar to the previous category. However, in this case, potentially also permanent errors can be addressed at the expense of resource blockage and extra synchronization. Moreover, in task scheduling schemes, active tasks cause more latency and power overhead but are simpler to schedule. Literature examples include Krishna and Shin (1986) and Bertossi et al. (1999) on periodic tasks and (Ghosh et al. 1994 ) on aperiodic tasks. For example, Figure 19 shows an example of scheduling four primary tasks (P1, P2, P3, P4) and their backups (B1, B2, B3, B4) on three processors, as indicated in Ghosh et al. (1994) . When the original tasks complete execution, their backups are deallocated and the space can be used for scheduling other tasks S .
Backward Execution: Retry With State Storage
The other group of backward techniques includes the techniques that retry the execution by storing the state of the system at intermediate points. The concept of proactively storing some part of the system state (and potentially additional information) to be able to restore the state and 50:24 G. Psychou et al. reexecute in the presence of errors was presented in Section 3.4.2. Here, checkpointing/rollback implementations at the mapping layer and the lower layers of the SW stack are discussed. These can be differentiated between techniques that operate within a single HW module (intramodule) and techniques that operate across HW modules (intermodule). The proposed subtree is shown in Figure 20 S . Different checkpointing schemes can be implemented for systems that do not have to deal with nondeterministic events and those that do. In applications of a deterministic nature, more design time knowledge can be exploited since the execution is more predictable due to the lack of nondeterministic events. For example, design-time knowledge of the control and dataflow graph (CDFG) can be exploited for optimizing the placement of the checkpoints. Rather than saving checkpoints at fixed intervals, checkpoints can be stored in a customized way so that the amount of stored data is minimized.
Using the CDFG to decide on the placement of checkpoints is not really practical in the nondeterministic applications due to the fact that the execution flow is decided in a nondeterministic way at runtime.
For such applications, typically extra effort has to be invested to save relevant information. For example, the interprocess dependencies are often recorded so that the execution can be accurately repeated during the recovery phase. 13 . The techniques can be further characterized according to when the checkpoint placement (location and/or frequency) is decided. If the decision is made at design time, the techniques belong to the offline category. Techniques that are applied during behavioral system synthesis or incorporate the checkpoint placement at the compilation or are even Fig. 21 . The insertion of a checkpoint in the CDFG turns the number of required HW registers to four instead of two at control step 3 (adapted from Blough et al. (1997) ).
Intramodule
combined with a design-time scheduling algorithm are offline techniques. When the decision is made at runtime, during program execution, they belong to the online category. These techniques are typically combined with a scheduling algorithm.
The techniques at the behavioral system synthesis level typically use the application CDFG and identify the optimal number and locations of checkpoints under some optimization constraints. Such optimization constraints may include the amount of expected rollbacks, the additional execution time due to the rollback, and the additional HW resources, such as registers needed to store the lifetime-extended variables (Blough et al. 1997 ). An example that illustrates the usage of extra registers can be seen in Figure 21 . Without the recovery facility, the HW registers used for storing intermediate variables c, d can be reused for variables e, f. Therefore, at control step 3, only two registers are needed for the storage of the variables e, f. However, when a recovery point is added at control step 2, two additional registers are required for storing temporary variables c, d.
Techniques at the compiler level employ the compiler to identify the optimal checkpoint locations. For example, the compiler can identify variables that are dead. These variables do not need to be included in the checkpoint. In the latter case, they can be used as assistance/guidance since the actual decisions on the checkpointing have to be made at runtime due to the need to handle nondeterministic events S . Beyond the earlier discussed types of systems, intramodule schemes may address applications that are amenable to numerous nondeterministic events: uncertain functions (e.g., human input functions), interrupts, system calls, and I/O operations due to communication with external devices. Schemes that incorporate the impact of such events in their rollback techniques store information beyond the state of the participating processes, such as interactions with external devices. Such additional pieces of information are typically called logs. Sometimes logs can be sufficient for the accurate reexecution of the process without storing the process state. In the literature, schemes have been developed that are tailored to address one specific type of nondeterministic event or more of them together. Here, a few examples are provided and the reader is referred to specialized surveys on the topic. Such concepts are elaborated in Elnozahy et al. (2002), Sancho et al. (2005) , Chen et al. (2015) , and Egwutuoha et al. (2013) . These surveys address both single-threaded and multithreaded/multiprocess applications S . Several techniques have been developed that do not explicitly bring the handling of nondeterministic events to the forefront. Some of them do not take care of them at all, and some of them address them partially. They focus on performing checkpointing in a-to a large extenttransparent way for the user. These techniques can be differentiated depending on the abstraction level. Typically, they are implemented either at the kernel level or the user level (see Section 2.2 and supplementary material). For kernel-level implementations, either the OS source code is available for modification or the user installs developed packages. However, the packages are only available for specific OSs. This implies that OS updates will require modifications of the packages. However, compared to user-level schemes, no extra compilation or linking has to be performed by the user and the application is fully unaware of the checkpointing. Checkpointing at the user level utilizes runtime libraries that are linked to the application program. Minimal changes in the source code may be required. Schemes implemented at the user level improve portability and offer the possibility to the user to identify program points at which state data are essential for restarting the execution.
Pros include the high error protection and general applicability. Cons include the potentially high storage and power overhead, the potentially very high latency, and performance (depending also on whether checkpointing is overlapped with normal execution). These costs vary depending on the exact implementation and selected granularity. Literature examples include Chandy and Ramamoorthy (1972) and Orailoglu and Karri (1994) on CDFG checkpoint placement, and and Ramkumar and Strumpen (1997) on employing the compiler for the checkpointing. Examples that address nondeterministic applications include Hendriks (2002) and Duell (2005) at the kernel level, and Plank et al. (1994) and Slye and Elnozahy (1996) at the user level S .
Intermodule 14 .
Although the multithreaded applications are also amenable to nondeterministic events that go beyond the process and thread concurrency issues, the focus of the related literature is on handling these process/thread dependencies so that rollback takes place effectively. Nevertheless, system-specific strategies have been developed that deal with events coming from the external environment, especially events due to communication with external devices S . Online multiprocessor checkpointing can be broadly characterized as local and global. Local schemes require that only a single process or a subset of the processes that have interacted save independently a checkpoint and no global coordination takes place. This means that the rest of the processes do not have to perform any action at that moment. Typically, in such schemes, when a process fails, all processors have to coordinate to create a consistent system-wide state. The interthread data dependencies of the processors that have communicated have to be recorded so that the interthread communication is correctly reconstructed at recovery time. Compared to global schemes, local schemes reduce the amount of data to be stored during checkpointing but typically require a more complicated recovery algorithm. Moreover, local schemes can potentially cause one process to roll back after the other until the system returns to a consistent state, potentially even back to the beginning. This type of rollback propagation is also called the domino effect. However, global schemes require that all processes take actions to take a single, global, system-wide checkpoint at distinct times. One disadvantage is that they become less scalable as the number of processes increase. Pros and cons are similar to the previous category with the extra overhead required to handle interprocess and interthread dependencies. As in the corresponding HW case, local schemes require less storage during checkpointing but typically need a more complicated recovery compared to global schemes. Literature examples include and Elnozahy (1994) on local schemes, and Duell (2005) and Batchu et al. (2004) on global schemes S .
Overall Mapping and Platform SW Classification
By combining the subtrees of the previous sections, the overall mapping and platform SW classification tree is built, as shown in Figure 22 . Starting from the top-level split of Figure 13 , the intermediate nodes (colored with pale yellow) are followed when necessary to reach the final classes (colored with darker yellow and numbered). 
USAGE OF THE CLASSIFICATION FRAMEWORK
Identifying the primitive components (corresponding to a primitive category) and their position in the framework first allows handling of the complexity of the sometimes highly sophisticated mitigation schemes. A "divide and conquer" view of the publication enables the reader to delve into the most relevant implementation details (when that is necessary) in a much more controlled way.
Mapping of Hybrid Schemes
In reality, the resiliency and mitigation approaches, which are present in research works, rarely belong to a single leaf of the previous, and indeed any, classification. The majority of the published work consists of hybrid combinations of the leaves. Positioning these works in the classification framework presented earlier allows one to obtain a better understanding of the techniques. In this section, the previous classification is applied on selective works, from a range of older to more recent literature, to illustrate how works can be classified by using the appropriate combination of leaves. For the mapping, capital letters are used to indicate the major group to which the technique belongs (H for platform HW and S for mapping and platform SW) and the number of the leaf from the corresponding figure (Figure 12 for platform HW, and Figure 22 for mapping and platform SW). The selection of these works is based on several criteria, the main ones being that they include hybrid schemes, that they cover a broad range, and that combinations from many different leaves are present to provide a better illustration.
First, a cross-layer co-exploration approach for reliability/energy optimization for video applications over wireless channels is presented in Khajeh et al. (2012) .
Platform HW. Reusing the Viterbi (by increasing the trace back depth) and Turbo (by increasing the number of iterations) channel decoders is discussed (because they are primarily used by WCDMA systems) as a complementary approach to reinforce the correcting capabilities (H.5).
Aggressive voltage scaling (AVS) is applied on sections of the WCDMA modem. The authors focus on the embedded memory organization for which a specific analysis of the errors is performed to characterize the read and write failures statistically, resulting in an analytic model. Based on this model, power savings can be maximized while maintaining a macro-level quality metric such as the peak signal to noise ratio (PSNR) of the image sequence (H.8).
Mapping and platform SW. The difference in the significance of groups of bits is used to perform selective protection by proper encoding at the middleware layer (S.3). In particular, UDP-Lite packetization is (re)designed in such a way that bits that are very sensitive to errors are better protected. 8 Second, the ERSA architecture (Cho et al. 2012 ) is a multicore architecture with the characteristic that some are super reliable cores (SRCs), whereas the majority are relaxed reliability cores (RRCs). The SRCs execute operations that are less resilient to errors. 9
Platform HW. The RRCs can be restarted by using a watchdog timer (H.13).
Mapping and platform SW.
A runtime scheduler reassigns a task that has failed on a particular RRC to another RRC (S.8). Moreover, the SRC can terminate the execution of a RRC task and reboot the RRC (S.12). Computations are discarded when excessively large fluctuations are present (S.7). This requires knowledge of the applications algorithm, so it is a hybrid itself.
Third, the JPL-STAR was a fault-tolerant computer designed to cope with many kinds of HW faults (Avizienis et al. 1971) .
Platform HW. TMR with voting is applied for the test and repair processor (TARP). This processor monitors the operation of the main computer and implements the recovery (H.1). One or more unpowered spares are provided for each functional unit and the TARP itself (H.2). All machine words (data and instructions) are encoded using error-detecting codes (H.3).
Mapping and platform SW. Checkpointing and rollback of programs to a previous state is also used when an error is detected (S.13). The system uses also a restart (named cold start) procedure in the TARP and the resident executive (S.11).
Comparison of Closely Related Schemes
In Figure 23 , a visual, color-coded illustration is provided for the scheme combinations that comprise each of the three discussed works.
Academic works at the same abstraction level. Research trends undergo a proliferating application of hybrid mitigation mechanisms. The work by Li et al. (2013a) provides a hybrid error mitigation mechanism, called DHASER, that exploits several mapping and platform SW approaches. In particular, this scheme adopts the following mitigation approaches: -First, a task mapping approach (S.8) is employed by analyzing the impact of single-event upset on the overall correctness of a running task, without correcting this error. The outcome is a generated parameter per task (based on the masking capability of the task). Thus, the tasks can be appropriately allocated to the PEs with the required level of resiliency for a reliable operation. For example, highly vulnerable tasks are executed on cores that are more robust and less vulnerable ones on cheaper and more power-efficient cores without recovery functionalities. 
S.9
Discard computations S.7 -----A second mechanism is applied to select the appropriate error protection mechanism to each PE. The choices are between reexecuting instructions inserted by the compiler (S.11) or a hybrid SW/HW checkpointing scheme (H.15, S.13). This selection is based on the expected to-be-mapped task from the previous technique and the characteristics of the targeted PE.
This classification also allows one to make a clear distinction with the platform SW part of the ERSA approach (Cho et al. 2012 ) discussed earlier (see the first two columns of Table 1 ). That approach is intended to deal with the same problem formulation, namely reliable runtime mapping of tasks on a heterogeneous multicore platform. ERSA formulates this as distributing tasks to SRCs and RRCs. It combines this also with algorithmic layer resilience, but this part will not be discussed here. To achieve this reliable task mapping, both approaches use partly similar and partly different techniques, and the systematic classification provides effective insight into this. The (S.8) techniques of both ERSA and DHASER are used in a largely similar way, but ERSA uses an instance of the (S.7, S.12, H.13) option on top of this, whereas the presently discussed approach uses instances of the (S.11, S.13, H.15) leaves in combination with it (see earlier). To illustrate this point further, in the third column, dTune has been added, which performs reliable task mapping (S.8) by combining reliable code versions (S.9) with HW-based redundant multithreading (H.9). Finally, in Table 2 , two schemes in the so-called dark silicon constraint are briefly compared-ASER and Hayat (Gnad et al. 2015 )-without providing a detailed explanation. In the supplementary material, a comparison of two industrial RAS schemes is elaborated from Intel (2011) and IBM (Mitchell et al. 2009) S .
DISCUSSION AND FUTURE CHALLENGES
The proposed classification was illustrated through a representative list of schemes to better absorb the related ideas and support the validity of the tree. Although the focus of the current work is not to present a comprehensive list of state-of-the-art schemes, some observations become clear regarding the evolution of techniques.
Observations derived from the proposed framework and literature. The literature on fault tolerance and resilience techniques has evolved in accordance with the trends in computer architecture and (platform) SW design development. Due to the power density issues that came along with technology scaling, there was a shift toward multicore designs and in general parallel processing. The inherent regularity and abundance of such designs offered opportunities for fault-tolerant techniques. The SW had to evolve as well to make use of these new designs (especially through multithreading). Another evolution has to do with adding more custom PEs (like GPUs) on the platform, together with general-purpose components, to accelerate part of the functionality. Of course, fault-tolerant techniques did not omit to exploit more ad hoc features of certain platforms, such as features that enable the out-of-order execution on superscalar platforms.
Commonalities and differences between HW-and SW-based schemes can be observed. Traditional fault tolerance was based on a few basic types of techniques: HW replication, reexecution starting from a previously saved state (typically at the OS level), and HW-based error coding schemes. Although these techniques have not been abandoned, over time there has been a clear boost in platform SW and mapping approaches. This does not reduce complexity for the designer but allows the cost to be reduced. Employing mapping and platform SW techniques even creates new flavors of the aforementioned techniques ending still in a big set of instantiated techniques. For example, TMR can be implemented at the MPI library (Fiala et al. 2012) , error coding can be implemented through an SW task (Shirvani et al. 2000) , and checkpointing can be implemented at almost all layers of the SW stack (see Section 4.4). But this is not the only commonality. Common concepts can be identified behind the techniques both in the forward category (with the exception of the operating conditions control for the SW) and the backward category (with the exception of having the amount of tasks fixed in SW since there is always some repetition of execution). Nevertheless, differences also exist, the most prominent being that mapping and SW provide a lot of flexibility due to the remapping possibilities of a given task sequence onto the "fixed" HW. This leads to several techniques that are not possible in the HW-based approaches: rearranging the instruction profile, fault-tolerant task scheduling, and fault-tolerant mapping on multi/many-core architectures S .
Trends and new directions. Applications themselves have been evolving. Beyond the advance in parallelizable applications, enabled by the architecture changes, there has been an explosion in the types of embedded applications, covering many different aspects of daily life. Fault-tolerant design has invaded such areas, even lifestyle applications, as multimedia applications (Andreopoulos 2013) . Networked applications further expanded the deliverable functionality possibilities. There is a clear tendency to exploit more of the application knowledge to minimize the cost especially when error tolerance is possible, such as in approximate computing (Sampson et al. 2015) . Other examples of emerging error-tolerant application domains are recognition, mining, and synthesis (RMS) (Dubey 2005) , as well as artificial neural networks (ANNs) (Temam 2012) . It is important to note that as applications and systems continue to evolve, new combinations of requirements that have to be satisfied are created. For example, enterprise distributed real-time and embedded (DRE) systems combine, among others, the requirement for high availability with real-time response, resource constraints, and certain quality of service (QoS) requirements, such as low-latency (Tambe 2010; He and Da Xu 2014) .
Adaptivity is another notable characteristic of more recently developed techniques. Techniques incorporate elements that allow them to be differentiated during runtime depending on the changing conditions. Knobs that allow fine-grain control enable cost reduction by satisfying the minimum necessary requirements. This is strongly enabled by the evolution and availability of monitor and sensor systems (Chandra 2014) . The system behavior can be adapted at runtime whenever significant environmental changes take place or according to varying error rates.
As indicated by several authors over the past years (DeHon et al. 2010; Reddi et al. 2012; , synergistic reliability approaches that combine several techniques across the same layer or cross layer can lead to near-optimal solutions. This is especially so as errors can be masked as they propagate through the different HW and SW layers (including the application itself). Therefore, by properly propagating information among the different layers and providing a suitable degree of adaptivity (with design-time and runtime knobs), the most costeffective solutions can be achieved. This has brought on challenges as well, as knowledge and expertise from different domains has to be combined.
Further technology trends like 3D integration, incorporating heterogeneous technologies on a single platform, and dark silicon pose new challenges and opportunities for fault-tolerance techniques. In 3D designs, the outer dies offer a shield against radiation particles. Thus, for example, part of the cache can be mapped in the inner dies without ECC protection to reduce energy and latency (Sun et al. 2011) . discuss the challenges coming from the dark silicon era. For a given (maximum) thermal design power (TDP), not all transistors can be simultaneously powered on at full performance so that the chip operates below the thermal safe temperature. For example, during a particular TDP (mode 5), the authors observed some natural trade-offs between transient fault rates and lifetime reliability (through aging). TDP mode 5 corresponds to a mode where cores are operated in near-threshold voltages. At this mode, 3 × to 30 × higher soft error rates have been observed. However, since the cores can be operated at reduced temperatures at this mode, they are exposed to reduced aging. This information can be provided to the runtime manager of the system, who can then make appropriate dynamic decisions depending on the system quality targets.
Moreover, as many systems today employ commodity-off-the-shelf (COTS) HW and/or SW components, the case of building a system's HW and SW from the ground up becomes rare. This leads to partial blackbox-based design, and the resulting lack of internal design knowledge adds an additional challenge on deriving appropriate reliability-driven approaches.
The aforementioned challenges require designers to come up with innovative solutions to ensure reliable digital system operation. This makes the request for a global view on the domain of reliability improvement techniques more necessary. This survey is a contribution in that direction.
RELATED WORK ON CLASSIFICATION SCHEMES
Over the past decades, both academia and industry have invested effort to describe fault-tolerance and mitigation techniques in a structured way. Here, some representative examples are discussed that are complementary to this work.
In the extended past, works such as Randell et al. (1978) and Siewiorek and Swarz (1982) , have described the principles of reliable system design, including terminology, metrics, and models. In addition, they have explained in detail the resilience features of highly reliable and highly available systems, such as commercial computers, spacecraft, and avionics systems. What is most relevant to the current work is that these works include a categorization of the broad literature at that time in the domain of reliability mitigation including (micro)architectural and SW layers. Rivers et al. (2011) provide an overview of the current state-of-the-art practices for error tolerance in server-class microprocessors. A basic discussion is performed regarding the abstraction layers and the different forms of redundancy (information, space, time), based on which error tolerance is achieved. The aim of the work is to give a review of current schemes and discuss approaches of promise for the future such that they do not present an elaborate classification scheme. Gizopoulos et al. (2011) present a taxonomy of error recovery and repair techniques for multicore processor architectures. In this work, a basic split of mitigation techniques is made between error recovery and error repair techniques. Error recovery is further split into forward error recovery (FER), which includes redundancy, such as for TMR, and backward error recovery (BER), which includes rolling back to a previously saved correct state of the system. They consider that error repair techniques include basically reconfiguration and graceful performance degradation. Abdallah et al. (2012) present a survey on designs of stochastic HW with relaxed guard bands to achieve reliable operation and satisfactory performance. Such designs address applications that are inherently error tolerant. Mittal and Vetter (2015) present a survey of techniques that have improved resilience as well as reliability metrics over the past 10 to 15 years. They use some rough categories for their presentation (redundancy based, compiler based, etc.), and they classify the presented schemes based on a few criteria, such as the processor component that is addressed, key approach/feature of the technique, and evaluation platform. The survey in Saha (2006) lists several SW-level techniques to counteract HW-induced errors but also SW bugs.
Many of the aforementioned survey works follow a different approach. They present selected works in detail or select techniques addressing specific types of systems or application domains. The proposed approach is closer to the one followed in Siewiorek and Swarz (1982) , with a similar motivation: organizing the categories by the type of techniques that are applied allows the universality of techniques to be manifested. This gives more opportunities to the designer to identify a technique that is potentially fitting for his problem and to customize it according to the specific features of a given system and application. To this end, the presentation of pros and cons for the classes can more actively assist. For example, consider that the designer needs a generally applicable solution for a medical application in which a strong requirement for reliability is posed. Based on the property of high error protection and general applicability, he would have to select more costly techniques, such as TMR or checkpointing/rollback. It is important to note that, in general, detection costs should also be taken into account when considering a mitigation scheme. In many cases, detection is strongly correlated with the mitigation, and the combined outcome
