State-of-the-art reliability techniques and mechanisms deploy full-scale redundancy, like double or triple modular redundancy (DMR, TMR), on different layers of the computing stack to detect and/or correct such transient faults. However, the techniques relying on full-scale redundancy incur significant area, performance, and/or power overheads, which might not always be feasible/practical due to system constraints such as deadlines and available power budget for the full chip (or a processor core). In this work, we propose a novel design methodology to generate and explore the architectural-space of heterogeneous reliability modes for out-of-order superscalar multi-core processors. These heterogeneous modes enable varying reliability and power/area trade-offs, from which an optimal configuration can be chosen at run time to meet the reliability requirements of a given system while reducing the corresponding power overheads (or solving the inverse problem, i.e., maximizing the reliability under a given power constraint). Our experimental results show that a pareto-optimal heterogeneous reliability mode reduces the core vulnerability by 87%, on average, across multiple application workloads, with area and power overheads of 10% and 43%, respectively. To further enhance the design space of heterogeneous reliability modes, we investigate the effectiveness of combining different processor state compression techniques like Distributed Multi-threaded Checkpointing (DMTCP), Hash-based Incremental Checkpointing (HBICT) and GNU zip, such that the correct processor state can be recovered once a fault is detected. We reduced the checkpoint sizes by a factor of ∼6× using a unique combination of different state compression techniques.
I. INTRODUCTION
Aggressive transistor scaling has led to an increased susceptibility towards several reliability problems, such as soft errors, at the hardware layer [1] . Soft errors are transient faults in the hardware that cause bit-flips in the micro-architecture, which may propagate to the application output and corrupt its state, or may terminate the application's execution [2] , [3] . The rate of occurrences of these soft errors is expected to increase with each new generation of microprocessor being The associate editor coordinating the review of this manuscript and approving it for publication was Cristian Zambelli . released, due to aggressive shrinking of the transistor's feature sizes and imperfection in the fabrication process [4] , [5] (see Section II).
Plenty of research works focusing on techniques like fullscale redundancy and checkpointing have been proposed towards prevention, detection, and/or mitigation of soft errors across the computing stack, i.e., the hardware and software layers [6] , [7] . Reliability at the hardware layer is ensured through redundancy of execution paths and/or hardening of pipeline components, i.e., full-scale Double or Triple Modular Redundancy (DMR, TMR). Software-layer techniques realize full-scale spatial/temporal redundancy by executing multiple redundant instructions or threads of an application, thereby ensuring a reliable output [8] - [10] . However, these full-scale redundancy techniques incur significant performance and energy overheads (e.g., in case of temporal redundancy), and area/power/energy overhead (e.g., in case of spatial redundancy).
Therefore, we propose to investigate the individual properties and requirements of an application workload to determine the component-level vulnerabilities of an out-of-order superscalar processor at design-time, i.e., enabling reliability provisions at a much finer granularity. Based on this analysis, we develop a wide range of heterogeneous reliability and checkpointing modes that enable efficient control over the achieved reliability and the incurred overhead, especially when considering the diverse resilience properties of different executing applications at different run time instances. Our previous work [11] provides an initial proof-of-concept of this work and preliminary results for the feasibility of reliability-heterogeneous cores. In this work, we significantly extend this concept, and provide a systematic methodology to integrate such reliability heterogeneous modes on a chip along with other different types of reliability mechanisms like checkpointing and state compression to expand the space of design trade-offs for reliability vs. overhead.
In a nutshell, we make the following novel contributions:
(1) Component-Level Vulnerability Analysis: We leverage the Architectural Vulnerability Factor (AVF) metric to perform a comprehensive vulnerability analysis for different pipeline components of a single-core and quad-core out-of-order superscalar processor when executing diverse application workloads. (2) A Methodology for Architectural-Space Generation and Exploration: We propose a novel methodology that:
(a) analyzes the architectural vulnerability of an outof-order superscalar microprocessor; (b) generates a wide range of heterogeneous reliability modes, such that each mode deploys distinct reliability measures in different pipeline components; (c) enables reliability-power trade-offs that can be used to optimize the applications' reliability requirements under the given power constraint, or vice-versa, minimize the power consumption under the given reliability constraints.
(3) A Run-Time System: We evaluate the run-time benefits of our heterogeneous reliability modes by executing various application workload mixes on our heterogeneous multi-core processor. We propose an evaluate two task mapping heuristics, namely, Vulnerability-Constrained Power Minimization and Power-Constrained Vulnerability Minimization. (4) Efficient State Compression: To further enhance the processor reliability and to increase the design space, we analyze and investigate combinations of state-of-the-art compression techniques to effectively reduce the storage requirements of checkpointing data. (5) Evaluation & Discussion: We evaluate the effectiveness of our heterogeneous reliability modes under diverse application workloads using a modified version of the cycle-accurate simulator gem5 to offer the required functionality. Fig. 1 illustrates an overview of our contributions in a design-flow for developing heterogeneous multi-core processors. Paper Organization: Section II presents the preliminaries and background information required to understand our proposed contributions. We discuss the system models in Section III. Section IV presents our methodology for generation and exploration of the architectural-space of heterogeneous reliability modes, including results that illustrate the benefits of the proposed approaches. Section V presents the related work on state-of-the-art reliability techniques and heterogeneous reliability approaches, followed by the conclusion presented in Section VI.
II. PRELIMINARIES AND BACKGROUND A. SOFT ERRORS
In the era of nanometer technology nodes, reliability threats like manufacturing-induced process variation, device aging, and transient faults are increasingly challenging the functional correctness and safety-critical aspects of the systems where these electronic devices are deployed [1] . These electrical disturbances that disrupt the normal operation of a circuit are called Single Event Effects (SEEs). It can be caused by the passage of a single ion through a circuit node. These disturbances can be either destructive or non-destructive. An example of a non-destructive SEE is Single-Event Upsets (SEUs). These errors can be single-bit or multi-bit depending upon various factors like the particle's energy, transistors dimensions, and electrical properties, operating scenarios (e.g., altitude of the device under usage), etc. These SEUs are transient faults (e.g., soft errors), which have emerged as a serious threat to the reliability of a digital system. These soft errors are generated at the hardware layer, due to four key factors, namely, (1) Alpha Particles, which are positively charged composite particles emitted during radioactive decay.
These particles travel through the semiconductor device thereby disturbing the electron distribution of the transistor [12] . (2) Cosmic Rays, which are a flux of energetic neutrons that are constantly emitted by the solar system [13] , [14] . (3) Thermal Neutrons, which are neutrons that have attained thermal equilibrium after dissipating all kinetic energy [15] . (4) Internal Factors such as random noise, signal integrity issues, cross-talks, and electromagnetic interference [3] .
The three phases of the soft error Phenomenon (adapted from [2] ).
Soft errors cause temporary bit-flips either in the control or data path of a micro-architecture, or in the on-chip memory cells, which may propagate to the application output (incorrect output), or may crash (incorrect instruction execution), hang (application enters an unresponsive state), or terminate the application execution [2] . Fig. 2 illustrates the soft-error phenomenon, which can be broken into three phases.
(1) First, in the ion-track formation phase (phase-I), a high energy particle (such as the cosmic rays discussed earlier) strikes the transistor to generate multiple electronhole pairs, which in turn increase the concentration of carriers along the ion's path. (2) In phase-II (current pulse generation), the ions collected at the depletion region form a ''temporary'' channel that funnels the current from source to drain, which could toggle the transistor state for tens of picoseconds. This can result in a bit flip in (i) the memory cell, which can be latched to the incorrect value until and unless it is overwritten by another value; or (ii) the logic gate that can potentially propagate to the final output of the circuit, thereby corrupting the output of the circuit. (3) In the ion diffusion phase (phase-III), for tens or hundreds of picoseconds, the charges diffuse into the depletion layer, thereby disintegrating the temporary channel.
B. INCREASING SOFT ERROR RATES
In the earlier generation technology nodes, the transistor dimensions were large enough that a temporary channel could not funnel the current from source to drain. Furthermore, due to reducing transistor dimensions, the rate of soft error occurrences is increasing with each new generation of processors being released into the market, due to their fabrication using continuously smaller technology nodes [4] , [5] (see Fig. 3 ). This is a major threat to the current world infrastructure, which heavily relies on electronics for all activities, such as work, communication, transportation, socializing, internet, etc. Even the day-to-day devices and services that people use, e.g., wearable devices such as smart-watches and fitness trackers, mobile computing platforms such as mobile phones and laptops, and on-demand cloud services offered by largescale data centers, heavily rely on the reliability of electronic devices. This becomes even more crucial for safety-critical application domains like aerospace, automotive, healthcare, industry 4.0, smart grids, smart homes, etc.
C. PROCESSOR HARDENING
Reliability at the hardware layer is typically ensured by the use of full-scale redundancy, which involves instantiating multiple instances of the hardware unit with the same set of inputs, to generate outputs that can be compared with each other to detect (in DMR) or correct errors using a voter circuit (in case of TMR), which we refer to as hardware hardening. Besides these hardware redundancy measures, techniques like software-level redundancy, application checkpointing and rollback, shadow latches, etc. (see Section V for an overview of the related work) can be also used to detect and mitigate soft errors. An overview of these hardware-level redundancy techniques is presented in Fig. 4 . DMR and TMR incur significant area and power overheads caused by the redundant hardware units and the additional circuitry used to detect or correct errors. Furthermore, since the additional hardware components execute in parallel, the throughput of the system is not affected, with a minimal gate-level increase in delay caused by the voter circuit. Typically, to ensure very high reliability, the entire processor pipeline (full-scale) is hardened, i.e., all the pipeline components are instantiated thrice with the same set of inputs and a voter circuit to elect the majority output, as illustrated by Gaisler's completely hardened LEON3-FT microprocessor that deploys redundancy in the register file and cache memory [16] . Fig. 4 also illustrates the gate-level implementation of the voter circuit, and how, in the case of soft errors, the majority output is elected and generated as the final output. Note, this leads to the possibility of the voter circuit becoming a single point of failure, which is mitigated by triplicating the voter circuit as well, and has been deployed, for example, in the Saturn Launch Vehicle Digital Computer [17] , [18] . In this work, without the loss of generality, we advocate the enabling of fine-grained reliability at different component level that can facilitate the instantiation of different hardening modes for different processor cores, thereby providing a wide range of reliability-power tradeoffs. As a proof of concept, we will showcase an example of using component-level TMR with a single majority voter circuit. However, any other reliability mechanism can be deployed as a knob at the component level.
D. OUT-OF-ORDER SUPERSCALAR PROCESSORS
Besides transistor scaling, architectural innovations such as deep pipelining, instruction-level parallelism, out-of-order execution, speculative execution, branch prediction, etc. have tremendously increased the computing capabilities of microprocessors. Almost all the current generation microprocessors are designed with such functionalities to ensure high system performance. For example, superscalar processors exploit an application's instruction-level parallelism to execute multiple instructions in parallel during the same clockcycle on multiple different execution units [20] . Out-of-Order processors execute instructions out-of-order, as opposed to the typical sequential execution, by exploiting the interdependency, or the lack thereof, of program instructions and the data processed by them [21] . This allows for executing ''independent'' instructions in clock-cycles that would be otherwise lost in pipeline stalls caused by control-or data-flow dependencies. Fig. 5 illustrates the control-and data-path of the ALPHA 21264 out-of-order superscalar FIGURE 5. ALPHA 21264 out-of-order superscalar processor architecture (adapted from [19] ).
microprocessor [19] , which is widely used in the architecture research community.
Alpha 21264, or Alpha 7, is a four-issue, seven pipeline stage superscalar processor architecture that is capable of executing up to six (four integer and two floating-point) instructions per cycle (IPCs) while sustaining four instructions simultaneously. During a program's execution, the processor can accommodate up to 80 instructions in the pipeline, which is kept track of using the processor's re-order buffer (ROB). The Alpha 7 processor also includes two cache levels, i.e., the primary and secondary caches. The processor uses a modified Harvard architecture that implements separate primary instruction (I-cache) and data caches (D-cache), typically of size 64KB each. The D-cache is dual-ported to allow simultaneous read and write on both rising and falling edge of the clock. This feature allows for reducing the area and power overheads associated with duplicating the cache, as in the Alpha 21164 microprocessor. The secondary cache, or B-cache, is usually a direct-mapped cache that is located off-chip and shared by all processor cores. Typically, L2-cache has a maximum capacity of 16MB and is constructed using synchronous static random access memory (SSRAM), which is accessed using a dedicated 128-bit highbandwidth bus [22] . Branch prediction in this microprocessor is implemented using a hybrid two-level branch prediction algorithm called tournament prediction, with a minimum branch misprediction penalty of 7 clock-cycles [22] . The processor was built using 15.2 million transistors, roughly 40% of which was occupied by the core processing unit and the rest of which was consumed by the caches and branch history tables [23] .
III. SYSTEM MODEL A. ARCHITECTURE MODEL
To cater for different application workloads with varying reliability requirements, we envision a reliability-heterogeneous multi-core processor (HMC):
where PC j denotes the j th processor core, such that, j ∈ {1, 2, . . . , M }, with a total of M processor cores in the HMC. Each processor core has L different architectural components, denoted as:
where C (j,k) denotes the k th component in the j th processor core. Each architectural component (like re-order buffer, register file, instruction queue, etc.) in each processor core (C (j,k) ) can be hardened by using mechanisms like TMR, DMR, Checkpointing, and Rollback, Error-Correcting Codes, or Razor latches. We denote the i th reliability technique of the component C (j,k) as:
Without loss of generality, in this work, we explore the applicability of TMR for designing the heterogeneous reliability modes. This leads to i = {0, 1}, where RT (C (j,k) ) = 0 denotes the unprotected component without any type of hardening and RT (C (j,k) ) = 1 denotes a component that has been hardened by triple modular redundancy, thereby enabling heterogeneous hardening.
The area of each processor core is denoted as A(PC j ), which is the summation of area of all the processor components, including the overhead of hardening certain components. Note, only a selective subset of the different heterogeneous reliability modes can be activated at run-time due to the total power constraint of a system while considering the application's reliability requirement. An overview of the symbols used in this work and their denotations have been presented in Table 1 .
B. APPLICATION MODEL
The applications are modeled as a set of task graphs {T , E} containing task and dependency information for all application workloads. T is denoted as T = {T 1 , T 2 , . . . , T Z } for a set of Z tasks. E is defined as E = {E xy | (T x , T y ) ∈ T } for the set of task dependencies. For the given processor core (PC j ) each task T q has the following execution properties:
• P(T q , PC j ), which denotes the peak power consumption, • L(T q , PC j ), which denotes the average performance in terms of execution time, and
• FPVF(T q , PC j ), which denotes the full-processor vulnerability factor.
C. RELIABILITY MODEL
The Architectural Vulnerability Factor (AVF) of a hardware component is defined as the probability of a fault to propagate to the final output resulting in an execution error [24] . We compute the AVF of a component C (j,k) as the fraction of bits vulnerable in each cycle (Vulnerable-Bits) to the total number of output bits (TotalBits) generated by component C (j,k) for a duration of N cycles. AVF of a component C (j,k) is '0' if the component is hardened, or produces no architecturally incorrect bits [24] . Note, all bits of a branch predictor are always architecturally correct, therefore a branch predictor's AVF is always '0'. Similarly, all bits of the program counter (PC) are always vulnerable, therefore the AVF of a PC is always '100' [24] . AVF is estimated using the following equation:
To study the impact of component hardening on the fullprocessor, we extend the AVF to define the Full-Processor Vulnerability Factor (FPVF) for a given application workload. We define FPVF as the ratio of the total number of vulnerable bits (VulnerableBits) in the processor pipeline for the duration they are vulnerable (VulnerableTime) to the total number of bits in the processor pipeline (TotalBits) for the total duration of application execution (TotalTime). It is computed using the following equation:
Overview of our architecture-space generation and exploration methodology for hardening out-of-order superscalar heterogeneous multi-core processors.
IV. HETEROGENEOUS RELIABILITY MODES OF OUT-OF-ORDER SUPERSCALAR CORES
A. METHODOLOGY OVERVIEW Fig. 6 presents an overview of our methodology for designing and exploring heterogeneous reliability modes for out-of-order superscalar multi-core processors. Our methodology targets two approaches for designing heterogeneous reliability modes: (1) Redundancy, and (2) Checkpointing. To ensure reliable execution at the hardware layer, we propose hardening the processor's highly vulnerable pipeline components. These pipeline components are selected based on the initial fault-injection experiments, or on the AVF values that are estimated based on the number of vulnerable bits and vulnerable time of each component (see model description in Section III). Furthermore, we ensure reliability by investigating state compression techniques that can reduce the size of checkpoint data. Before moving on to our fault-injection and vulnerability analyses, we will present our experimental setup for better understanding.
B. EXPERIMENTAL SETUP
To evaluate the vulnerability, power and area requirements of the proposed heterogeneous reliability modes, we have modified the well-established open-source tools like the cycleaccurate system simulator, gem5 [25] and HP's power and area estimator tool McPAT [26] . Our extensions to these toolchains provide the following functionality: (1) estimate the vulnerability of all pipeline components by determining their AVFs [24] , (2) support for heterogeneous reliability modes by hardening key pipeline components using component-level redundancy [11] , but not full-scale pipeline triplication all the time, and (3) checkpoint processor states using mechanisms like Distributed Multi-Threaded Checkpointing (DMTCP) [27] , [28] and Hash-Based Incremental Checkpointing Tool (HBICT) [29] , [30] . Due to its high customization capability, we use the Alpha 21264 four-issue out-of-order superscalar core [19] as our target platform. We use the primitive Linux kernel 2.6 that is available with the default installation of gem5 as the operating system for our ALPHA 21264 micro-processor. Furthermore, we extend the concept of AVF towards the FPVF metric (see Section III) to evaluate the impact of component hardening on the reliability mode, for a given application workload. To account for a wide range of applications, we evaluate the proposed heterogeneous reliability modes using the MiBench application benchmark suite. Fig. 7 presents an overview of our experimental setup.
C. VULNERABILITY ANALYSIS
We evaluate the vulnerability of an O3 superscalar Alpha 21264 core components [19] for the Bit-counts, SHA, Dijkstra, and Patricia application workloads [31] . We analyze the vulnerability of the following key pipeline components: The results of our vulnerability analyses are presented in Figs. 8 and 9.
From the results obtained, we make the following key observations: • The AVFs of the different pipeline components vary for different application workloads.
• We have identified three key pipeline components (Integer ALU, Store Queue, and Re-order Buffer) that are more vulnerable during the execution of SHA when compared to Bit-counts.
• Similarly, the re-order buffer is 27% and 46% less vulnerable to soft errors during the execution of Patricia and Bit-counts, when compared to workloads like SHA and Dijkstra. • Similar differences in component-AVFs can be observed when varying multi-threaded application workloads, from the PARSEC benchmark suite, are executed on a multi-core processor, as shown in Fig. 9 . These components have different AVFs because of the type of instructions being executed and their application-specific properties (compute or memory-intensive, instruction-level parallelism, cache hit/miss rate, etc.). For example, components like the Re-order Buffer and the Store Queue are more vulnerable in SHA because of higher levels of instructionlevel parallelism and more store instructions.
Based on this analysis, we can infer that hardening certain components of the pipeline increase the reliability of a core more than hardening the other components. Therefore, we generate a wide range of reliability-heterogeneous Alpha cores, and explore this architectural-space in terms of reliability, power, and area, to select a configuration that increases the reliability of application executions while decreasing the area/power overhead.
D. FAULT INJECTION
Fault injection techniques are typically used to study, analyze and evaluate the behavior of a system susceptible to faults [32] - [34] . The fault model for the ALPHA core components is based on single-and multi-bit transient faults. The soft error rate for each component is defined as the product of error rate and the component's AVF. The soft error rate of the processor's pipeline components have been derived from the works presented in [35] , [36] . To account for a component's spatial vulnerability (N FI ), the number of faults injected in a pipeline component is proportional to its on-chip area. We define P flip as the probability that a high-energy particle strike leads to a change in the logic state of a pipeline component. Furthermore, to facilitate fast simulation, the faults are injected in the region of interest, the components, registers, and cache lines used by the application. The application output is classified into 3 major categories, namely, (1) correct output, (2) incorrect output, and (3) program failures (N error ), which comprise of multiple scenarios such as unaligned instruction, unmapped address, and segmentation fault. The error rate (P error ) of a transient fault in the component leading to an error in the application execution is defined as follows:
An overview of the methodology used to inject and analyze faults in various pipeline components is presented in Fig. 10 . Based on the vulnerability and fault models presented in Section III and the configuration of the target processor, including its pipeline components, we generate a list of fault files, that is provided as an input to the fault injection engine. This is used to insert faults/bit-flips into the target processor platform during the application's execution using a cycle-accurate simulator, i.e., gem5. Though 1-bit and 2-bit faults are common, we tried to evaluate our techniques under multiple fault cases to study the efficacy of the proposed contributions. For instance, 4 MBUs are indeed rare and may only occur when a very high energy particle strikes a nanoscale transistor at high altitude. However, besides our fault cases, we used this aggressive case in our fault injection experiments as well to identify the criticality of pipeline components in extreme cases, i.e., the components that are highly vulnerable to soft errors and to observe the error rates and types when injecting single-vs. multi-bit faults, and whether a similar fault trend is observed. The architectural parameters for the Alpha processor and the fault injection experiments are illustrated in tables 2 and 3. We study the output obtained from these simulations, which contains a list of correct and erroneous outputs. These outputs are then compared against the golden execution to estimate the type of error and the frequency of these error occurrences for various pipeline components. A subset of the results obtained from this experiment is illustrated in Fig. 11 .
The results in Fig. 11 depict the error rate of three pipeline components, namely, Level-II Cache, Integer Arithmetic Logic Unit, and Instruction Queue. Faults injected in the L2-cache lead to four major types of error and correct output. The rest of the types are classified into the ''others'' category. The four major error categories are: (1) incorrect output, (2) unaligned instruction, (3) unknown instruction, and (4) out of memory. The label A depicts the applications with a higher percentage of correct output when compared to the others. On average, the Bit-counts and SHA applications produce a correct output more than 80% of the time, whereas Dijkstra and Patricia, on average produce a correct output less than 70% and 60% of the time. The changes in L2-cache vulnerability can be attributed to two factors, i.e., the amount of data being accessed and the number of load/store instructions in the application. The reduced vulnerability of the L2-cache, in the applications depicted by label A, can be attributed to the decreased amount of data being accessed from the L2-cache and the lower number of load/store instructions in the Bit-counts and SHA applications. This directly corresponds to a higher number of L1-cache hits, thereby reducing the criticality of the data present in the L2-cache and reducing its architectural vulnerability. Therefore, the probability of a soft error in L2-cache leading to an error during the execution is higher in an application with a relatively higher number of load and store instructions, and the amount of data accessed from L2, as compared to the others. Similarly, the label B depicts the percentage of fault injection experiments that lead to an unmapped address. As explained in the earlier example, due to the higher number of load and store instructions in Dijkstra and Patricia, the large number of unmapped addresses can be attributed to the corruption of bits during address generation. Similarly, due to their computeintensive nature, a higher number of incorrect outputs are generated by faults injected in an ALU during the execution of applications like bit-counts and SHA. Faults injected in the Instruction Queue cause three major types of error, namely, (1) unknown instruction, (2) invalid instruction, and (3) segmentation fault.
E. HETEROGENEOUS RELIABILITY MODES FOR ALPHA CORES
As discussed in Section IV-C, the AVF of the pipeline components varies for the different application workloads. Hence, we propose to harden a combination of the key pipeline components in out-of-order superscalar processors, instead of employing full-scale TMR across the complete pipeline, to increase core reliability while reducing the area and power overheads of full-scale TMR. This generates a design space of multiple heterogeneous reliability modes (RM), nine of which are illustrated in this work (and unprotected core). Table 4 presents our list of nine proposed heterogeneous RM and the components that are hardened in these modes using TMR. Hardened components have three instances with the same inputs, and a voter circuit at the output to determine the majority. An overview of the proposed heterogeneous reliability modes for Alpha 7 processor is presented in Fig. 12 .
We evaluate the vulnerability of our heterogeneous reliability modes by executing applications from the MiBench application benchmark to estimate the FPVF for each scenario. We also evaluate the area and power overheads incurred by each reliability mode. These results are illustrated in Fig. 13 .
From the results obtained, we make the following key observations:
• Different heterogeneous reliability modes can reduce the full-processor vulnerability to different extents depending upon the properties of the executing application. For example, reliability modes like RM2, RM6, and RM9 reduce the processor vulnerability of SHA by more than 50%, but not of Dijkstra, even though they have similar vulnerabilities in all other reliability modes.
• Hardening specific components in the pipeline can significantly reduce the overall processor vulnerability. For example, key components like Rename Map (RM) and Reorder Buffer (ROB) effectively reduce the FPVF for all applications, as shown by the heterogeneous reliability modes RM4, RM7 and RM8. However, utilizing these hardening modes incurs significant area and power overheads.
• Certain heterogeneous reliability modes are very effective in reducing the FPVF by a large margin for very small area/power overhead. For example, RM2 and RM6 reduce the FPVF by more than 50% for <75% area and power overheads when executing SHA.
• Hardening all pipeline components without hardening the most highly vulnerable component of the system introduces very high overheads without reducing the vulnerability of the system significantly. This is illustrated by the reliability mode RM9, in which the ROB is not hardened. This reliability mode has area and power overheads close to ∼200% with insignificant reductions in FPVF when compared to RM4, which significantly reduces the FPVF for comparatively lower overheads. Using the data gathered from the simulation of our designs, we perform a design space exploration that trades-off FPVF, area, and power overheads to extract the pareto-optimal designs that suit the target application best. The pseudocode of the pareto-frontier extraction algorithm is presented in Algorithm 1. The corresponding results are illustrated in Fig. 14 . The x-axis denotes the FPVF, whereas the y-and z-axes denote the power and area overheads, respectively. The design labeled U in all applications is the unprotected core that is highly vulnerable to soft errors. As it does not deploy any redundancy measures, it has zero area and power overhead, and hence lies on the pareto-front. The pareto-optimal reliability modes for the applications are presented in Table 5 . RM4 is pareto-optimal for all applications except SHA. The register file is highly vulnerable to soft errors during the execution of SHA and needs to be TempArray(TempSignal, :) = temp; 17: end if 18 : end for 19: if TempSignal >= 1 then 20: ORM = TempArray(1 : TempSignal, :); 21: end if hardened to reduce its vulnerability. The reliability mode RM7 is pareto-optimal for all four applications and reduces the FPVF on average by 87% with average area and power overheads of 10% and 43%, respectively.
A super-set of the pareto-optimal reliability modes for all these applications can be selected to design a heterogeneous multi-core processor. We can build the chip by selecting the reliability modes from this super-set such that the form-factor and cost constraints are adhered to. At run-time, the required reliability modes can be switched-on/-off depending upon the power constraints of the system.
Overhead Analysis: The design-time methodology for architecture-space exploration is, fundamentally, a heuristic and is very fast in identifying a design-time configuration of the microprocessor, typically in terms of minutes. The run-time system is also a very simple heuristic and therefore requires only a few hundred cycles to reach a run-time solution, where the exact time depends upon the number of cores in the system, number of protected components, types of reliability modes, and number of executing applications.
The simulation time (different from the simulated cycles of the processor in gem5) of each experiment is in the order of several tens of minutes, and since we execute numerous fault injection campaigns, the overall time of testing is over multiple weeks. Note, the computations and simulations performed inside gem5 also depend on the computational resources of the host platform, the number of simultaneous tasks executing on the host, and resources dedicated to the simulation environment. 
F. RUN-TIME SYSTEM
Although this work focuses mostly on the design-time aspects of achieving heterogeneous reliability in out-of-order superscalar processors and studying their reliability vs. power/area trade-offs. In this sub-section, we present a brief overview of a run-time system for our proposed heterogeneous multicore processor that aims at selecting the set of Pareto-optimal modes for cores such that the vulnerability of their respective applications can be minimized while satisfying their power constraints. For evaluation, we consider a 10-core processor that is composed of all the 10 heterogeneous reliability modes discussed in sub-section IV-E. We illustrate the benefits of our reliability modes by executing 5 application workload mixes, the compositions of which are presented in Table 6 , on the 10-core heterogeneous processor to evaluate the power-overheads and FPVF of the multi-core system for each workload mix. The task-to-core mapping can be done using one of the following heuristics: (1) Vulnerability-Constrained Power Minimization: In this technique (see Fig. 15 ), we impose a vulnerability constraint on each task in the mix, i.e., each task is only mapped sequentially to a core that can successfully execute the task under the imposed vulnerability constraint. If a convenient core (one that satisfies vulnerability constraint) is not available, then the task is not scheduled immediately. The goal of this approach is to minimize the power overhead of the complete processor.
(2) Power-Constrained Vulnerability Minimization:
This approach (Fig. 16 ) imposes a constraint on the maximum power overhead of the whole processor, i.e., the task-to-core mapping is stalled when the power constraint is exceeded, which is an overhead of 100% for each task in the mix. The goal of this task mapping policy is to minimize the FPVF. The results of this evaluation are presented in Fig. 17 , in which we make the following key observations:
• The proposed reliability modes can be deployed in a heterogeneous multi-core processor to reduce the power overheads of the executing application workloads, based on the application's workload requirement.
• The proposed reliability modes can either be used to minimize the power overhead or the full-processor vulnerability factor as illustrated by the two task mapping policies.
Although 100% task mapping is not achieved as in the un-protected or full-protected case, this can be resolved by efficiently selecting the reliability modes to be deployed in the HMC considering the potential application workloads and/or by using a task mapping algorithm that can efficiently schedule the tasks to processor cores.
G. STATE COMPRESSION TECHNIQUES
Checkpointing and Rollback is an effective way of guaranteeing reliability at the software layer by means of providing both spatial and temporal redundancy. A checkpoint is a snapshot of the processor state at any instant in time. Checkpoints allow the system to roll back to the previous safe states in case a failure is detected and re-execute instructions. Fig. 18 presents an overview of the methodology that we use for checkpointing and state compression. Checkpoints are typically inserted intermittently into the target application for periodic state retention and, if required, rollback to an earlier processor state, i.e., in case of faulty execution. Typically, the collected processor's state information is stored in the main memory or off-chip non-volatile memory, which can still be used for a rollback in case of power-off. In our case, to reduce the size of checkpointing data, we introduce another stage of state compression, that utilizes state-of-theart compression techniques to generate a wide range of compressed checkpoint variants. The optimal compressed variant can be selected based on the system's resource constraints and available on-/off-chip memory. In case a fault is detected in the current processor state, during the application execution, the previous safe-state is decompressed and rolled back to ensure the correct execution of the application.
The standard checkpointing mechanism deployed by gem5 comes with certain caveats. This technique does not preserve cache and pipeline states in a checkpoint because of which frequent restoration from such checkpoints results in a performance loss if deployed in real-world systems. Therefore, we explore techniques like DMTCP [27] , [28] that implement checkpoints in the Linux process to store the processor state as well as data present in the cache hierarchy. The backend checkpointing mechanism of DMTCP is accessible to the programmer via numerous APIs. These APIs can be used in conjunction with the front-end gem5 pseudoinstructions for checkpoint creation/recovery. Since these software-based checkpoints are often large, the checkpoint is compressed using gzip and HBICT to save memory. HBICT [29] , [30] provides DMTCP support for deltacompression (relative to the previous compression), which is further compressed using gzip (a combination of lossless data compression algorithms like LZ77 and Huffman coding).
We investigate the effectiveness of these techniques in all possible combinations, by applying them one after the other, on applications from the MiBench application benchmark suite by simulating them on the ALPHA core using gem5. The results of this experiment are presented in Fig. 19 . From these results, we make the following key observations:
• the combination of DMTCP and gzip is highly successful in reducing the checkpoint size by ∼ 6×
• the combination of DMTCP, HBICT, and gzip techniques reduce the checkpoint size by ∼ 5.7×. HBICT, which utilizes delta-compression, requires all previous checkpoints for efficient rollback. Since the base file size of HBICT+DMTCP is 1.03× larger than the file size of DMTCP, the effectiveness of the combined state compression technique (DMTCP+HBICT+gzip), with respect to DMTCP, is reduced.
V. RELATED WORK
Reliability is a major research challenge that is being tackled by the community at large via global initiatives like the NSF's Variability Expedition 1 and DFG's SPP 1500 Priority Program. 2 Research works from the academia and industry alike have addressed the challenges associated with technology scaling across the layers of the computing stack.
A. MITIGATION STRATEGIES
The work in [38] presents the Razor approach, which can be used to dynamically detect and correct timing errors by monitoring the error rate at run-time to tune the circuit's supply voltage. The adaptive approach presented in [39] enables percore dual modular redundancy (DMR) through the means of DVFS to offer a stable soft error rate (SER). An OS-level dynamic reliability management system for heterogeneous architectures for achieving an optimal trade-off between reliability (lifetime) and power/performance efficiency is presented in [40] . A software-level technique is presented in [9] , which is used to detect errors by duplicating instructions during compile time by using different variables and registers for new instructions. A software-controlled fault-tolerance scheme is proposed in [41] that allows programmers and designers to trade-off between performance and reliability based on the system's requirement. Luo et al. [42] quantify the tolerance of application to memory errors to propose several new hardware/software heterogeneous-reliability memory systems to reduce their vulnerabilities and data-center costs. A hardware-software co-design approach for soft error mitigation in embedded systems has been proposed in [43] , which includes a generic software hardening environment that is used to generate a ''hardened'' code variant and a hardening infrastructure called FTUnshades in FPGAs, which is used to access the reliability of the complete hardwaresoftware stack of the embedded system.
B. RELIABILITY MODELING
The work in [44] demonstrates the concept of Program Vulnerability Factor, which captures the architecture-level fault-masking properties of the underlying program while exhibiting workload-driven changes in the AVF for all architectural components. Li et al. [45] analyze the correlation between the soft error rate and the energy consumption behavior of on-chip data caches. This involves analyzing (1) the leakage energy optimizations on soft errors, and (2) the energy overheads of protecting on-chip memories against soft errors. A software-level technique proposed in [46] introduces transient fault tolerance in a multi-core system by exploiting process-level redundancy (PLR) to create multiple application threads and compare them to ensure correct execution of the application. A software-level approach to enable self-adaptive reliability for multi-/many-core systems is proposed in [47] by activating redundancy measures based on the application's dependability requirements. A simultaneous and redundantly threaded (SRT) processor is presented in [48] , which provides transient fault tolerance with significantly higher performance. Redundant copies of the program threads are executed simultaneously on the SRT to ensure accurate application execution. Kriebel et al. [49] analyze and present the reliability issues of on-chip memory systems to propose a reliability-aware reconfigurable last-level cache architecture that adapts the cache parameters to concurrently execute multi-threaded workloads at run-time to minimize their vulnerabilities. A soft error-aware cache architectural space-exploration methodology is presented in [50] for varying the application workloads and cache parameters for the complete cache hierarchy. An adaptive soft-error resilience (ASER) approach is presented in [51] by proposing and managing reliability-heterogeneous dark silicon many-core processors (darkRHPs). The proposed darkRHPs deploy redundancy at the architecture level, i.e., hardening either the full-processor pipeline of an in-order LEON3 processor and/or caches. The work in [52] presents an approach that exploits the on-chip dark-silicon to synergistically mitigate reliability and variability challenges associated with transistor technology scaling. An overview of different heterogeneous fault-tolerance schemes for both hardware and software layers is presented in [11] , which also provides an initial proof-of-concept of this work.
This work, on the other hand, focuses on generating and exploring a wide range of heterogeneous reliability modes using two key approaches, i.e., (1) Redundancy, by hardening different combinations of the pipeline components for an out-of-order superscalar processor, and (2) Checkpointing, by reducing the size of the checkpoint data using efficient compression techniques.
VI. CONCLUSION
In this work, we presented a novel architectural-space generation and exploration methodology that is used to develop a wide range of heterogeneous reliability modes for out-oforder superscalar processors. By analyzing the architectural vulnerability of key pipeline components, we have observed that the pipeline components have varying architectural vulnerability factors for different application workloads. Based on this observation, we propose to harden the pipeline components in multiple different combinations with varying levels of reliability to cater to the application's requirement while minimizing the power/area overhead. We have also extended the AVF metric to define the Full-Processor Vulnerability Factor (FPVF), which can be used to estimate the processor's vulnerability as a whole, for a given application workload, instead of analyzing the vulnerability of each component. The pareto-optimal reliability mode RM7 is successful in reducing the FPVF by 87% on average, with area and power overheads of 10% and 43%, respectively. We have also illustrated the benefits of our proposed approach at run-time by evaluating two simple task-mapping strategies, which can be used to either minimize power or processor vulnerability based on the system's constraints. To further enhance our design space for heterogeneous reliability, we also investigate effective state-compression techniques to reduce the data size of a checkpoint by ∼6×. Our studies illustrate that in powerconstrained scenarios, enabling reliability at a fine granularity, and deploying reliability-heterogeneous super-scalar out-of-order processors bear a significant potential for realworld systems, especially when considering diverse vulnerability profiles of different applications, which can further vary depending upon their input workloads. MUHAMMAD SHAFIQUE (M'11-SM'16) received the Ph.D. degree in computer science from the Karlsruhe Institute of Technology (KIT), Germany, in January 2011. Before, he was with Streaming Networks Pvt. Ltd., where he was involved in research and development of video coding systems for several years. He has been a Full Professor with the Computer Architecture and Robust Energy-Efficient Technologies (CARE-Tech.), Institute of Computer Engineering, Technische Universität Wien (TU Wien), Austria, since November 2016. He holds one U.S. patent and has (co-)authored six books, more than ten book chapters, and over 200 articles in premier journals and conferences. His research interests include computer architecture, power-/energy-efficient systems, robust computing, hardware security, brain-inspired computing trends, such as neuromorphic and approximate computing, hardware and system-level design for machine learning and AI, emerging technologies and nanosystems, FPGAs, MPSoCs, and embedded systems. His research has a special focus on cross-layer modeling, design, and optimization of computing and memory systems, and their deployment in use cases from the Internet-of-Things (IoT), cyber-physical systems (CPS), and ICT for development (ICT4D) domains. Dr. Shafique is a member of the ACM, SIGARCH, SIGDA, SIGBED, and HIPEAC, and a Senior Member of the IEEE Signal Processing Society (SPS). He has given several Keynotes, Invited Talks, and Tutorials. He has served on the TPC of numerous prestigious IEEE/ACM conferences. He received the 2015 ACM/SIGDA Outstanding New Faculty Award, six gold medals in his educational career, and several best paper awards and nominations at prestigious conferences, such as CODES+ISSS, DATE, DAC and ICCAD, Best Master Thesis Award, DAC'14 Designer Track Best Poster Award, IEEE TRANSACTIONS OF COMPUTER ''Feature Paper of the Month'' Awards, and Best Lecturer Award. He has also organized many special sessions at premier venues and served as the Guest Editor for the IEEE DESIGN AND TEST MAGAZINE and the IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING. VOLUME 7, 2019 
