Abstract-Hardware errors are no longer exceptions in modern cloud data centers. Although virtualization provides software failure isolation among different virtual machines (VM), the virtualization infrastructure including the hypervisor and privileged VMs remains vulnerable to hardware errors. What makes matters worse is that such errors are unlikely bounded by the virtualization boundary and may lead to loss of work in multiple guest VMs due to unexpected and/or mishandled failures. To understand reliability implication of hardware errors in virtualized systems, in this paper we develop a simulation-based framework that enables a comprehensive fault injection study on the hypervisor with a wide range of configurations. Our analysis shows that, in current systems, many hardware errors can propagate through various paths for an extended time before an observed failure (e.g., whole system crash). We further discuss the challenges of designing error tolerance techniques for the hypervisor.
Ç

INTRODUCTION
I N the era of warehouse-scale computing, a typical data center consists of hundreds of thousands of servers with multi-core processors, multi-level caches, and a large amount of main memory. All these components-logical circuits in CPU, memory cells in cache and memory -are not immune to soft errors [1] , [2] . Technology scaling already points to a projection of escalating soft error rates in future chips [3] , which combined with high server count would result in a shorter MTBF (mean time between failures) and require system-level techniques on fault prevention, detection, and recovery.
On the other hand, virtualization has been widely adopted in large data centers to increase overall resource utilization. In a typical environment as shown in Fig. 1 , a hypervisor provides an abstraction layer on top of hardware resources, working with privileged VM (e.g., Dom0 in Xen) to serve the requests from guest VMs (e.g., DomU). This abstraction layer can consume a considerable amount of hardware resources (e.g., CPU cycles). Virtualized servers are soon expected to host hundreds of VMs, and dedicated I/O cores will be required for data-intensive VMs [4] . It is not uncommon to see significant amount (e.g., 50 percent) of CPU resource spent for the hypervisor and Dom0, which can be much higher than that of traditional non-virtualized OS (e.g., 2-10 percent) [5] .
Unfortunately, while the failure of a guest VM is unlikely to affect other VMs, a hardware error induced failure in the virtualization infrastructure (the hypervisor and privileged VMs) may propagate beyond the virtualization boundary and result in multiple VM failures, or worse, the wholesystem failure. The goal of this work is to understand the characteristics of hardware errors in virtualized systems through a comprehensive set of fault injections. To this end, we design a simulation-based fault injection framework as shown in Fig. 1 and perform over 46,000 injections to characterize the reliability of the hypervisor and overall virtual systems against soft errors. Our fault injection tool is designed as a module of a full system simulator, requiring minimum modifications to the original system. It can inject soft errors into any function of the target hypervisor, removing the constraints of existing software tools [6] , [7] , [8] .
The main contribution of our work is two-fold: First, our framework enables a comprehensive fault injection study on the hypervisor with a wide range of configurations, including para-virtualization, full virtualization, and VMs running on separate or shared cores. Our fault injection framework targets the most frequently used hypervisor functions. This approach greatly reduces the injection scope while discovering 19 percent more cases than random fault injection, and reveals some critical cases that would otherwise have been missed. Second, our analysis has led to several key observations-nearly half of injected errors result in system wide crashes including all VMs running on the machine. While some of the errors lead to quick crashes, others can propagate to the control and guest VMs, which makes them hard to detect and protect against. We have studied in depth a number of critical error propagation paths.
This paper is an extension of our prior work [9] . The new additions include: (1) an extended motivation section (Section 2) to discuss the vulnerability of the virtualization infrastructure to soft errors; (2) a new section (Section 5) to discuss the details of the experiments; (3) new results and analysis on the error propagation behaviors such as Section 6.6 and detailed error propagation path; and (4) a discussion in Section 7 to evaluate the effectiveness of the existing methods. We also discuss the potential techniques based on the results and analysis shown in this paper.
The rest of paper is organized as follows. Section 2 presents the motivation and background. Section 3 discusses related works in fault injection and VM fault tolerance techniques. In Section 4, we discuss the framework in detail, including hypervisor profiling method, fault injection, and analysis method. Section 5 describes the experimental methodology. Section 6 presents the results and our observations and analysis on error propagation behaviors. We discuss the implications of the results on current fault detection techniques and potential directions for new fault tolerance techniques in Section 7 and conclude in Section 8.
MOTIVATION AND BACKGROUND
At the high level, a virtual systems has three major components. The first component is the hypervisor, such as Xen and KVM, which is directly running on the hardware, and provides the key management functions, including CPU and memory virtualization, device I/O virtualization (some functions provided by the control/driver domain), guest VM management, and the interfaces for communication between the hypervisor and VMs. The second component is the control VM (Dom0), which runs the device drivers and assists the guest VMs in performing the I/O operations; The third component is guest VMs (DomUs) in which the user applications run. Virtualization can be achieved with hardware support (hardware-assisted virtualization or full virtualization) or without hardware support (para-virtualization). Full virtualization provides an emulated hardware platform so that the guest OS runs as if it were in a native machine. Paravirtualization requires the guest VM to have a modified OS to allow access to physical resources in the host machine.
In virtualized systems, the hypervisor and the control VM may be frequently activated to serve requests from hardware and applications. In fact, they may consume a considerable amount of CPU resources, exposing themselves to CPU soft errors.
To demonstrate this problem, we conduct a number of experiments to measure the CPU utilization of the virtualization infrastructure (the hypervisor and the control VM). This experiment is conducted on a Dell R410 server with two six-core Intel Xeon processors. We measure the accumulative CPU usage to demonstrate the vulnerability of virtualization software code. The y-axis shows the amount of CPU resource usage on multi-core processors when the benchmark is running. Note that the amount of CPU usage exceeding 100 percent means it consumes more than one CPU. Although the applications also consume other system resources such as memory and I/O, our focus in this work is the soft error in CPUs that may affect virtualization infrastructure. As CPU utilization increases, so does the likelihood of the virtualized system affected by CPU soft errors. Therefore, we measure the CPU usage to demonstrate the vulnerability of virtualization software code.
The left side of Fig. 2 shows the CPU usage of Dom0 and the hypervisor when I/O intensive benchmarks are running in two guest VMs. The CPU usage is normalized to the overall system CPU usage including all VMs. All benchmarks show that 40 to 50 percent of CPU usage are spent on the hypervisor and Dom0. The CPU usage is significantly higher than traditional OS (2-10 percent) in non-virtualized environments [5] . Moreover, as the system hosts more VMs, the CPU usage will be even higher. As shown in the right side of Fig. 2 , when the number of VMs (running postmark [10] ) increases from two to six, the CPU utilization of virtualization infrastructure increases linearly to more than three times. When six VMs are running concurrently, the virtualization infrastructure occupies even more than one physical core. This means the chance of the virtualization infrastructure being affected by soft errors is not trivial. It is an important aspect of system reliability that should be taken into consideration.
It is expected that future systems will likely host hundreds of VMs on one host for high resource utilization, and use dedicated I/O cores [11] . In such a system, although the activities of the hypervisor on the guest-domain CPUs can be reduced, the overall workload from the virtualization infrastructure will remain high on these dedicated cores.
To summarize, the virtualized system may have significant hypervisor activities, making it vulnerable to soft errors in CPU. The reliability of the virtualization infrastructure is an important issue that should not be ignored. This observation motivates us to characterize the reliability of virtualization infrastructure against soft errors and analyze error propagation behaviors, providing insights for designing effective fault tolerance techniques.
RELATED WORK
Fault Injections are commonly used to evaluate and understand the system reliability. Two related projects [8] , [12] also conduct fault injection in virtualized systems. In [8] , virtual machines are leveraged as the fault injection infrastructure to evaluate system reliability. It focuses on the Fig. 1 . Virtualization architecture on a multicore processor (on the left) and the proposed error injection framework (on the right). A soft error effecting the hypervisor instructions running on core 1 can be propagated on the same core (path 1 and 2), to other hypervisor instructions on core 3 (path 3), to the control VM instructions in core 0 (path 4), or to one of guest VMs in core 2 (path 5). system inside of virtual machines, but not the virtualization infrastructure. In [12] , a debugger-based framework is proposed to evaluate virtualization platforms (Xen and KVM) under a variety of faults. But fault injection on soft error in Xen hypervisor is only limited to several processes in the control VM rather than the hypervisor itself. This is because of the limitation of the debugger-based framework. In contrast, our fault injection framework is implemented based on a full system simulator, which allows us to inject soft errors into any component of a running hypervisor. The simulation based framework has better controllability and observability. This is the key to analysis of error propagation behaviors, such as analysis in Section 6.7. Such analysis not only allows us to evaluate the system reliability, more importantly, but also provides insights for designing new fault tolerance techniques.
Fault injections are also used in non-virtualization systems to study application resiliency. In [13] , a fault injection tool is designed to evaluate high performance computing (HPC) applications. The paper leverages QEMU to inject faults into instructions of HPC applications. The proposed tool allows fine-grain control over the fault injection parameters, such as the timing and error locations. Another fault injection tool for scientific applications is proposed in [14] , which is designed based on PIN [6] . It specifically targets data objects of applications, and allows soft errors to be injected into any data objects in running applications. In [15] , a fault injection tool is designed to inject multiple independent faults into running applications and analyze their behaviors. The paper studies characteristics of simultaneous faults and injects faults into device drivers in Windows operating systems. All these projects are software based fault injection tools, targeting either applications or device drivers in non-virtualized environments. Our work is different because we are focusing on the hypervisor itself in the virtualization environment. The hypervisor is a light-weight software with the highest privilege in the system. The common tools used for applications such as PIN are not suitable for the hypervisor. Therefore, we design a simulation based framework specifically for this purpose.
FRAMEWORK
The fault injection framework contains four components: 1) a profiler is used to profile the hypervisor and identify the most frequently used functions (i.e., top functions); 2) an analyzer is used to analyze the top functions and generate an injection map containing injection candidates; 3) an injector is used to interact with the simulator and conduct fault injection experiments; and 4) a collector is used to collect the log and system states from a simulated serial port for in-depth error analysis. The architecture is shown in the right side of Fig. 1 . In this section, we describe each component in detail.
Hypervisor Profiling
The main purpose of the profiling phase is to identify the most frequently used functions in the hypervisor as injection candidates. Fault injection is conducted while the system is running. The fault injection space increases as the hypervisor is running. It is not practical to inject all possible faults. To manage the fault injection workload, we inject faults into the most frequently used functions in the hypervisor, as they are more likely affected by CPU soft errors than rarely used functions.
To this end, we profile the Xen 4.1.2 hypervisor using OProfile [16] and UnixBench [17] benchmark suite. We measure the average percentages of CPU time of hypervisor functions. Fig. 3 shows the accumulative utilization of all hypervisor functions (CDF). Due to space limitation, we omit the function names, and use numbers to represent them. The function with a smaller number has a higher utilization. The distribution of utilization shows a long tail. Ideally, we would like to inject errors in every function. For practical reason, we choose 69 most frequently used functions that cover about 90 percent of total utilization to strike a balance between the coverage and the experiment runtime.
We classify these functions into four subsystems according to their functionalities.
CPU management subsystem (CM) provides the interfaces for VMs to access physical CPU resources, including emulating architecture specific instructions and privileged instructions for virtualization, virtual CPU (VCPU) scheduling, and modifying control registers. Taking VCPU scheduling as an example, a (credit based) scheduler will schedule each VCPU to run on physical CPUs exclusively for a certain period of time, and swap the VCPUs out for other VCPUs. For example, function csched_schedule is one of the functions that are used in the credit scheduler. In total, 20 functions are identified in this category.
Memory management subsystem (MM) manages shared DRAM and ensures the isolation of each domain. The Xen hypervisor uses a pseudo-physical memory model to achieve this goal, similar to the concept of the virtual memory in modern operating systems. The hypervisor manages the physical memory, and the pseudo-physical memory is an abstraction layer that is provided to the guest VMs for emulating the physical memory. The applications running within the VMs will use the virtual memory provided by the hypervisor. For example, function page_get_owner_and_reference returns the domain which owns the memory page and the reference count that indicates the uses of this page frame by a domain. In total, 28 functions are identified in this category.
Hypercall and control management subsystem (HC) contains low level functions that handle system calls or hypercalls. A hypercall page, essentially a memory page, is provided to each guest VM when it is started. When a system call is required in a guest VM, it will directly call the address with the hypercall page to initiate the hypercall. For example, function syscall_enter is the low-level handling routine provided by Xen to replace the original one in Linux kernel. This function saves the current context and passes the hypercall arguments to the hypercall handler in the hypervisor. In total, 13 functions are identified in this category.
Domain management subsystem (DM) provides a set of functions to manage VM states, for example, the function update_vcpu_system_time is used to update the system time of a guest VM. In total, eight functions are identified in this category.
Analysis of Instruction Traces
In this analysis phase, we analyze the instructions of the most frequently used functions to generate an injection map. Because we are targeting architecture-level registers in CPU (more details in Section 5.2), this analysis phase identifies relevant registers in the instruction trace as injection candidates. Each entry in this injection map consists of the target register and the timestamp for injection. We set up a number of breakpoints on the targeted function addresses to trap the activation of these functions. The breakpoint handler is responsible for collecting instruction traces.
Examples of entries in the injection map are shown in Table 1 . Intel x86 instructions may contain two source registers, one source register or no source register. The selection of target registers varies depending on instruction types, and we explain the details as below.
First, for the instructions containing two registers, one is source register (src register) and the other one is both source register and destination register (src-dst register). If a fault is injected into src-dst register, the original fault will be overwritten by the new faulty output value right after the execution of this instruction. In this case, only src-dst register carries the faulty value after the execution of the current instruction. On the other hand, if a fault is injected into the src register, both registers may contain faulty values after the execution is done. Comparatively, the latter case has a larger chance to allow the injected error to propagate. Therefore, we choose to implement the latter case. Second, for instructions containing one register, this register will be selected as the injection candidate. Third, for instructions without registers, we will skip them unless they are branch instructions (with a branch target address). If an error occurs in the address of the branch target, an incorrect instruction will be loaded. Therefore, errors in those addresses are equivalent to those in rip which holds the address of the next instruction. Therefore, we inject the errors into rip before it is used to load the next instruction.
In this way, target registers are selected to construct an injection map. Each entry in the injection map contains a candidate register and its timestamp. For each candidate, there are 64 possible injections (64 bits). The experiment time of injecting errors into every bit is prohibitive. Hence, we select five bits that are evenly distributed among 64 bits for injection including the most and least significant bits. The similar bit selection strategy has been used in previous works [18] .
Profiling Based Fault Injections
Fault injections are conducted based on the injection map. In each fault injection run, only one fault will be injected. For the first two cases in Table 1 , the injection should be done right before the instruction is executed. Faults that are injected in this way are guaranteed to be activated when the instruction is executed. For the third case in Table 1 , the injection should be done after the branch instruction is executed. At that time, the rip register holds the next instruction address. Injected fault will be activated when the system continues to execute the next instruction. After the fault is injected, we allow the system to continue and observe the system behaviors.
To this end, we monitor hardware exceptions after the injection and also collect the serial console outputs that are generated by the hypervisor to analyze the root causes of the system crashes.
EXPERIMENTAL METHODOLOGY
Simulation Infrastructure
The main goal of our framework is to steer the fault injection experiments towards the faults that would occur in the hypervisor in order to analyze error propagation behaviors. Ideally, one would like to implement the entire fault injection framework in an unmodified computer system, which is a challenging undertaking due to the lack of repeatability and determinism. On the other hand, a simulation environment provides better controllability over the fault injection experiments and better observability for analyzing the system states and outcomes. Therefore, we develop our fault injection framework as the modules that run in the Simics system [19] that is a widely used full system simulator. In our framework, the instructions are emulated by Simics. Since Simics is a full system simulator, we use the term simulation rather than emulation throughout the paper.
The simulated system is configured with four 64-bit processors (each of them is comparable to a AMD Hammer 64-bit processor), 2 GB memory and a 60 GB disk. The system is equipped with Xen 4.1.2 and Debian 6 with Linux kernel 2.6.32. The architecture of the virtualized environment is similar to Fig. 1 , and we use one Dom0 running on one VCPU and two para-virtualized DomUs, each of which is assigned by one VCPU, 512 MB memory and a 10 GB virtual disk. These three virtual CPUs are attached to three physical processors respectively. This way, the activities on each domain will be limited to its own physical processor. We set up VMs in this way to observe the error propagation between physical CPUs. We select a wide range of benchmarks from PARSEC [20] , SPEC2006 [21] and Postmark [10] to exercise I/O (e.g., postmark, freqmine [20] , x264 [20] ), CPU (e.g., canneal [20] , bzip2 [21] ), and memory (e.g., mcf [21] ) resources. The goal is to generate hypervisor activities and to evaluate the generic hypervisor behaviors under soft errors. These benchmarks are selected to stress CPU, memory and I/O subsystems in the hypervisor, so that we can have a high coverage of all types of hypervisor activities. When conducting fault injections, the same benchmarks are running in two DomUs in parallel.
Fault Model
We choose single-bit flip soft errors in architecture-level registers in CPUs as our fault model, including general purpose register and the instruction pointer. The soft errors may occur in CPU registers, function units, cache and memory. This work focuses on CPU soft errors, and we plan to explore memory errors as a part of our future work. The injection space in hardware is very large. However, not all errors will be visible due to the masking effect. For example, the errors injected in a destination register right before its value is updated will not affect the correctness of architectural states and results. The errors in the dirty cache lines and unallocated memory can be masked as well. Although this masking effect is a part of fault characteristics, it makes the injections inefficient because masked errors do not generate useful results for error analysis. Our fault model automatically filters out those masked errors. Statistical fault injection has been used for evaluating system reliability [22] , [23] . In this paper, instead of statistically evaluating the virtualization system reliability, we focus on studying error propagation behaviors. Therefore, we choose to use the top function injection to focus on the cases that eventually lead to system failures.
Faults may occur in other CPU components that are not visible to software, such as re-order buffer (ROB) and branch predictors. Our model covers these non-masked faults as they would occur in these components and propagate to architecture-level registers (visible to software) as single bit flip error. Single bit flip is the most common fault model for CPU soft errors, because the chance of multiple errors occurring at the exact same time is very rare. Also, we find that single CPU error is already disruptive, and most of cases are system crash. Multiple bit errors are likely to cause crashes more quickly with even higher percentage.
Injection Process
The analyzer first generates an injection map that contains entries with the targeted CPU, register, timestamp and injection type indicating if this is an injection in branch outcome. The injector repeatedly reads entries from the injection map and carries out the injection. In each injection, the injector first loads the initial state from the checkpoint, and then continues to execute till the specified timestamp is reached. If the injection type is the branch outcome, the fault will be injected into the rip register on the specified CPU after the current branch instruction is finished. Otherwise, the fault will be injected to the targeted register on the specified CPU before the current instruction is executed. After the injection is done, we allow the system to continue to observe error behaviors. The longer observation window can provide a higher probability of capturing long latency faults, but it also increases the total simulation time. As it has been shown that the errors with extremely long latency are very rare [24] , we use a two-billion-cycle window to achieve a good balance between simulation accuracy and time. To observe the error behaviors, we collect the serial console outputs that are generated by the hypervisor and the Dom0 kernel. This way, we can analyze the root causes of the system crashes. Since our target system is the hypervisor and VMs, we focus our analysis on those cases that will result in VM failures or host system failures.
RESULTS AND ANALYSIS
In our experiment, we conduct a total of 25,000 injections across six benchmarks with para-virtualization, targeting the most frequently used functions (top function injection). As a comparison, we conduct additional 12,000 random fault injections. We randomly choose a region during the execution of the benchmarks, and select source registers or instruction pointers of the instructions running on the Dom0 CPU as the injection candidates. To test the impact of virtual machine configurations on the fault injection results, we also conduct 3,000 fault injection where VMs are located in a shared core, and 6,000 injections with full virtualization. In this section, we will first discuss the overall results of fault injections with different configurations, and then analysis five aspects of error behaviors in detail: crash type, fault location, crash latency, failure location, and symptoms. We highlight our observations when discussing error behaviors. Fig. 4 shows the overall results of fault injections with different configurations. Compared with random injection, top function injection generates 19 percent more crash cases for analysis. We also find that top function injection allows us to discover more error behaviors, such as all-VM crash, that are not shown in random injection. Obtaining more interesting results for analysis with limited resource is very important for the fault injection framework. Also, we find that it is difficult to use random injection to identify active CPU regions. We only observe interesting results from benchmarks (postmark, freqmine, and x264). For the other three benchmarks, random injections fall into the regions where the Dom0 CPU is in the idle state. As they are more CPUintensive, hypervisor and Dom0 activities are less than other benchmarks, making it difficult to conduct fault injections. Top function injection uses breakpoints to track the most frequently functions, so it can easily identify those functions. After we examine the results of full virtualization, we do not find new types of cases besides the one we find in para-virtualization results. Thus, in the later part of the section, we mainly use the results from para-virtualization to study error propagation. Note that there are some differences in terms of the percentage of crash cases between full virtualization and para-virtualization. We consider this is largely caused by the different injection regions, which are randomly selected. 
Overall Results of Fault Injections
Analyzing Crash Type
Analyzing the error propagation behaviors in terms of crash types is critical, as the detection and recovery mechanisms may vary depending on the crash type. In this paper, we classify the results into four types: 1) System crash, which represents a crash or hang in the host system. This is the worst crash case, since the hypervisor, the control VM, and all guest VMs are affected. 2) One-VM crash, where a failure leads to a crash in one DomU, but the Dom0 and other DomUs are not affected. 3) All-VM crash, where a failure causes the crashes of all the DomUs. In this case, the hypervisor and Dom0 are still running, and no system reboot is initiated. And 4) Correct, where the injected fault does not result in a visible crash in Dom0 or DomUs. We further investigate these cases in terms of the application correctness, but we find the exit status and results of applications are correct. Therefore, we consider these outcomes to be correct.
Clearly, one can see from Fig. 4 that our simulation based injection framework is able to effectively carry out the fault injection experiments and produce more failure samples for analysis. Specifically, our top-function injections provide a large number of crash cases at 53.3 percent on average, which is 19 percent higher than the random injection (34.1 percent). In total, 304 one-VM crash cases are discovered in the topfunction injections. Comparatively, only two such cases in total are found in random injections. Additionally, two all-VM crash cases are identified in the top-function injection results, which are not discovered in the random injection, and shortly we will show that they are critical for understanding the error propagation in Section 6.7. Furthermore, our framework is able to deliver a more consistent resultless than 5 percent discrepancy for crash cases among different benchmarks. It is important to note that some randomness in error injections may be beneficial in terms of discovering uncommon cases, but it requires significantly large sample size so we intend to collect in the future.
The rightmost side of Fig. 4 shows the error injection results when running Xen in the full virtualized mode. Full virtualization leverages hardware virtualization support and the hypervisor behaviors are different from the paravirtualized environments in handling the transitions between Dom0/DomUs and the hypervisor. Specifically, the transitions are handled with the help of VMX (Intel) or SVM (AMD) instructions such as vmread and vmwrite. Therefore, instead of using top functions, we intentionally target these special instructions as well as normal instructions. As one can see from Fig. 4 , the errors injected in full virtualization produce similar results of different crash cases across various benchmarks.
Observation #1: Soft errors in the hypervisor may cause various types of failures -more than half of the errors can lead to system-wide crashes that will affect all the VMs running on the shared host, and soft errors may even cause all VM failures without triggering system wide reboot.
Analyzing Fault Location
Fault location is defined as the function where a fault is injected. Recall from early discussion, we classify the Xen functions into four subsystems and here we aim to understand the reliability characteristics of each subsystem. In Fig. 5 , the left side shows the percentages of all results (using the left y-axis), and the right side filters out the correct cases and shows the percentages of only crash cases (using the right y-axis).
Observation #2: All Xen subsystems are vulnerable to soft errors with more than 40% of incorrect results. And the CM and HC subsystems have the highest percentages (over 58 percent) of crash cases, while the MM with relatively lower probability at 44.5 percent. Interestingly, the DM subsystem is accounted for more one-VM and all-VM crash cases. Nevertheless, this result shows that no one subsystem is less critical than others, thus a good fault tolerance mechanism for a hypervisor should provide a complete coverage over a wide range of kernel functions.
Analyzing Crash Latency
Crash latency is a way to measure the impact of errors. Should a fault not detected quickly, it would have a higher probability of corrupting the hypervisor and VM states, which in turn could result in full system failure and costly recovery. Crash latency is calculated as the number of instructions between a fault is injected and a fatal exception is captured by the hypervisor. The instructions are running simultaneously on multiple cores. Here we calculate the number of instructions on the core where the fault is injected. In addition, we use the first exception for calculation (a conservative estimate on crash latency), because it is possible that several exceptions are triggered before system crash. Fig. 6 shows the cumulative distribution of the latency of system crash cases.
Observation #3: Most crash cases have relatively short latency ( < ¼100 instructions), 88% for system crash and 60 percent for one-VM crash cases, respectively. A well designed lowcost checkpoint mechanism may be needed to protect the system in these cases. However, observation #4: there still exist a considerable amount of the crashes with long latency ( > 10,000 instructions), about 5 percent in both cases. On a multicore processor the total number of instructions that are executed on all cores during error propagation can be significant, which in turn poses a challenge for long-term checkpoint management and error recovery.
Analysis of Failure Location
Failure location is the domain where a fault is manifested to a fatal error symptom, such as a fatal exception or an infinite loop. When a failure occurs, the error handling routines in the hypervisor and Linux kernel may print out the debug message that we examine to pinpoint the failure location. Examining the crash location helps us evaluate the effectiveness of the built-in fault detection mechanisms in Xen and Linux. Recall from the discussion in Section 1, we have identified five possible error propagation paths, leading to five failure locations: 1) Immediate (Imm); 2) Same hypervisor function; 3) Another hypervisor function; 4) Dom0 and 5) DomU, where the faults propagate to the Dom0 and DomU kernel respectively. In general, Imm has the shortest error propagation latency, and Dom0 and DomU failures usually have much longer latency. We analyze all the system crashes in Fig. 7 . The left side of the figure shows the results grouped by four Xen subsystems, and the right side by the benchmarks.
Observation #5: A fault leading to a system crash may become visible quickly for half of time, and propagate down the execution path for the rest. Specifically, in nearly 50 percent of the cases, a fault will show visible fatal symptoms immediately after the injections. In this case, since the failures are detected quickly, chances are that system states are intact and the system recovery is likely. Most (70 percent) of HC errors (in hypercall instructions serving system calls from guest VMs) fall into this category.
On the other hand, on average 51.2 percent of system crashes involve error propagation to the same or different hypervisor function. More importantly, observation #6: error propagation is rampant in the MM subsystem (memory management related instructions), followed by the CM and DM (cpu/ domain management). Although the percentage is smaller than other cases, we note that there are 2.5 percent cases where the failures happen in the Dom0. This result also explains the previous observation on long crash latency, and indicates the difficulty of system recovery in events of corrupted states in either the hypervisor or Dom0. The indepth analysis on those cases will be conducted shortly in Section 6.7. The distributions of failure locations among benchmarks are relatively consistent with small variance.
Analyzing the Symptoms
Symptoms are the visible abnormal behaviors, such as fatal exceptions, and page faults. Here we use a list of well defined symptoms to indicate the types of error outcomes, which are also important for designing the detection and recovery mechanisms. In Table 2 , we have identified 15 symptoms in total -five are generated by Xen, and ten by Linux kernel.
We analyze the symptoms for all system crash cases. We do not include the one-VM and all-VM crash, for the number of these cases is relatively smaller, and most of them do not have visible symptoms. Fig. 8 shows the percentages of the symptoms divided by failure locations. Observation #7: The majority (96 percent) of crashes that are captured by the symptoms are related to three exceptions, namely, general protection fault (Xen-GPF), fatal page fault (Xen-FPF), and triple fault exceptions (TF), which are usually associated with the faults in addresses, stacks and segments in the hypervisor. This fact points to a potential solution of using such symptoms for error detection, as suggested in [3] , [25] , [26] . However, as we will discuss later, significant modifications and extensions will be required for a symptom based approach to work in a hypervisor.
Another interesting observation is that there are only four types of symptoms in the Imm failure case, compared to nine types reported in the Dom0 case. Clearly, observation #8: as an error spreads, a large number of system modules will be affected, leading to a variety of possible symptoms on multiple layers. Once again, this calls for agile error detection mechanisms to capture errors early. This also necessitates the error tolerance techniques that are able to handle various error propagation scenarios. In the following section, we study the error propagation and propose several guidelines for designing effective fault tolerance mechanisms for hypervisors.
Analysis of Error Propagation
Understanding error propagation is very important for designing effective fault tolerance mechanisms. In previous section, we categorize faults by crash types, symptoms, latency, and failure locations.
In this section, we analyze crash cases and present eight representative cases with examples shown in Table 3 . We explain each case in detail below: Corruption in registers (example #1). In this example, the injected fault changes the register rbx to an invalid address. As a result, the fatal page fault exception is triggered when this address is being used in current instruction. Since the crash is captured right after the injection, only the register that carries the fault is corrupted. If carefully managed, one may be able to recover the states of the hypervisor, as well as those of the Dom0 and DomUs. All Imm failures belong to this category.
Corruption in local variable (#2). In this case, the fault propagates from a register to the address of the local variable in the hypervisor function. The general protection exception is triggered at the first time this variable is used. The propagation latency is 39 instructions, and no other states are affected.
Corruption in function stack (#3). Here the stack pointer is changed by the fault to a valid but incorrect address. Although this does not affect the data in the current function, the caller function's stack and data are corrupted. The exception is triggered right after the function return. In this case, the whole stack of the caller function is corrupted.
The above three examples all have the failures that have happened within the hypervisor. For the following examples (with the exception of the last one, all-VM crash), although the errors are originated in the hypervisor, the failures happen in either the Dom0 or a DomU.
Corruption in the hypercall (#4). Here the fault changes the hypervisor data, and subsequently alters the control path. As a result, the hypervisor exits to the Dom0 earlier than the correct instruction trace. Therefore, there are multiple data corruptions in the hypervisor data including the return values of the hypercall. But the hypervisor returns to the Dom0 in a short period of time (60 instructions), and the Dom0 states in the hypervisor stack are not corrupted. A crash will likely happen when the Dom0 uses the incorrect return values.
Corruption in the shared memory (#5). In this case, the hypervisor is serving a request from a DomU (the hypervisor request handler is running on the same physical CPU as the DomU). The fault changes the value of the string length, resulting in additional string mov operations. The repeated mov will change the stacks, resulting in corrupted values in shared memory data. During this process, the Dom0 initiates another request to the hypervisor to access shared memory data, and the fault in stack segments is reported by the Dom0 kernel. We note that in this case the fault propagates across both domains and CPUs. The error propagation is illustrated in Fig. 9(a) .
Corruption in Dom0 states (#6). In this example, when the Dom0 issues a hypercall, the hypervisor will save the states of the Dom0 to its stack and restore after the hypercall is The loop count (dcx) of repeat mov instructions is altered, and incorrect number of strings are moved to the extra segment. When Dom0 CPU tries to access the segment, the failure occurs, shown in Fig. 9 The fault changes the control path and the CPU goes to an infinite loop.
All-VM Crash Other Function
N/A Unknown unknown served. In this case, the fault is injected into the instruction that operates on the stack containing the Dom0 states. Specifically, it alters the return address of the Dom0. Therefore, after the system returns to the Dom0 context (vmexit), the invalid instruction address will trigger an exception. We track the usage of the fault, and find that although this case has a long latency (over 200K instructions), the return address is not used for address or computation. Therefore the hypervisor state is correct and only Dom0 states are corrupted. The propagation is illustrated in Fig. 9(b) . Corruption in application (#7). Here the fault changes the branch outcome and corrupts the data in the hypercall. When the context is switched back to the application, the incorrect return address is loaded and triggers the failure in the DomU. The failure is isolated in the DomU and will cause this DomU to crash. This case is different from the previous case #6 in that the hypercall data are corrupted and the error propagates through hypercall and DomU kernel to the application. But the crash is contained within the DomU.
All-VM crash (#8). There are two such cases where the fault causes the failure of both DomUs, while the Dom0 seems to be functioning. We examine the instruction trace and find that a fault alters the control flow in the hypercall. And the hypervisor eventually goes to an infinite loop. Hardware support will be needed to address this issue.
To study the error propagation behaviors, we pin two DomUs and Dom0 onto separate cores. This setup helps us to identify inter-core and inter-domain error propagation behaviors such as case 5. This is often used in real environments to provide better performance isolation. Note that it is also possible that the domains are not pinned onto separate cores. To evaluate the potential effect of two configurations, we set up another set of fault injections by deliberately pinning all domains to one core. Here we conduct 3,000 injections using same benchmarks in paravirtualized environments. The results show that there are an average of 45 percent cases as system crash, which are very close to what we observe from the separate core configuration. By further examining the results, we observe the similar error propagation patterns, which are not unexpected due to hypervisor behaviors. In the case of using a shared core, multiple domains are multiplexed on the core in time sharing fashion-each domain has its timeshare to run. In both cases (whether domains reside on separate or shared cores), when a fault is injected, the core is running the hypervisor code that handles the request from a domain or a device. From the perspective of our fault injection process, the two configurations have similar states of the cores at a particular time instance. As a result, we observe similar error propagation in both cases.
In summary, we have identified several important error propagation behaviors: 1) an error may propagate from one CPU to another CPU through shared data structures in the hypervisor; 2) an error may propagate from the hypervisor to the Dom0/DomU kernel, and potentially to the applications; 3) an error may propagate to the Dom0/DomU through hypercall return values, shared memory, and stacks containing virtual machine control states.
Applicability of Experimental Setup
The experiment in this section is conducted based on benchmarks rather than real applications. Virtualiztion environments are now commonly used in cloud computing data centers, which host various cloud applications. These applications may not have exact some behaviors as those benchmarks. However, we believe our results and analysis are still valid and can be used for understanding the virtualization reliability. We explain the reason as below.
Our goal is to evaluate the generic hypervisor behaviors under soft errors rather than application behaviors. In fact, from the hypervisor point of view, applications activities are abstracted to only certain well-defined hypervisor behaviors, such as CPU and memory management, through the interfaces provided by the hypervisors (e.g., hypercalls or VM Exit reasons). Therefore, while the hypervisor may host various applications, the hypervisor activities are constrained, which is the focus of this paper. As long as the selected workloads can stress the hypervisor in various activities (i.e. CPU, memory and I/O), they are essentially equivalent for the purpose of our experiments.
That being said, it is possible that a data center hosts only one (or several) type of applications, which may particularly stress a portion of hypervisor activities, e.g., CPU intensive only. However, this depends highly on specific data centers, and is different case by case. In this work, we choose to study generic scenarios rather than limiting our study to a specific scenario or applications.
Without losing generality, we choose a suite of commonly used benchmarks in this study, so that we can have a high coverage of all types of hypervisor activities. If there were a need to understand the hypervisor reliability under a particular scenario, our fault injection framework can be customized for that purpose. To demonstrate this, we add a new experiment of fault injections on the webserver applications in FileBench [27] . We first profile the hypervisor utilization when web server is running inside of one DomU. The CPU utilization of the hypervisor and Dom0 are listed in Table 4 . Web server is an I/O intensive applications, which is similar to benchmarks used in Fig. 2 . We can see that the profiling result of webserver also resembles the results in Fig. 2 .
We also conduct about 860 fault injections when web server is running inside of DomU. Table 5 shows two types of injection outcomes: system crash (46 percent) and correct (54 percent). After examining the system crash cases, we find that crash location is either in the hypervisor or Dom0.
DISCUSSIONS
The hypervisor is at the heart of a virtualized system. An error resilient hypervisor will significantly improve the reliability of the entire system. Given the above observations and study, we believe that an error resilient hypervisor shall possess the following three properties: 1) Lightweight. As the hypervisor is in the critical path of many system requests, any performance degradation shall be avoided because it would negatively affect the performance for guest VMs and user applications; 2) Strong Inner-Hypervisor Protection is desired for dealing with the most severe (system-wide) crashes which lead to the failures of guests VMs. The error coverage shall be sufficiently broad to protect a wide range of hypervisor functions; and 3) Powerful Inter-Domain Error Detection to minimize error propagation to the control and guest VMs.
In this section, we study existing fault detection and tolerance techniques and analyze their applicability to a hypervisor and the effectiveness in the virtualized environment. Table 6 summarizes the pros and cons of representative methods.
Generally speaking, fault detection techniques have been proposed in the microarchitecture level [29] , [30] , [32] , compiler level [5] , [33] , software level [34] , [35] and crosslayer [3] , [25] , [26] . With the exception of hardware only approaches, most of software based techniques are designed to detect errors in applications, rather than the hypervisors. OS level techniques such as [36] , [37] have also been developed to deal with hardware failures like devices.
Hardware dual modular redundancy (DMR) is one of mostly used techniques. [29] , [30] are based on the idea of using a spare hardware module for fault detection as multicore and simultaneous multithreading technique (SMT) become common in modern processors. Specifically, Redundancy MultiThreading (RMT) leveraging simultaneous multithreading technique is proposed in [28] , [29] , where the faults can be detected by comparing the outputs of two threads. In addition, [30] proposes a chip-level redundant threading (CRT) technique that leverages the multicore architecture to reduce the design complexity and performance overhead.
DMR can detect hardware faults with a nearly 100 pecent coverage and very short detection latency. If used in virtualized systems, DMR shall be able to detect hypervisor crash cases, including Imm, Same Function and Other Function. So those cases are detectable using DMR. Also, with DMR, soft errors in hypervisor execution are unlikely to propagate to Dom0 or other DomUs. So those cases are not applicable as listed in the table. However, due to high area and energy overheads, DMR are often not found in commodity processors. For cost savings most cloud providers build their data centers using commercial off the shelf (COTS) components, in some cases one or two generations old. [29] unreported No @ @ @ N/A N/A CRT [30] unreported No @ @ @ N/A N/A DDFV [31] 1.8% No * * * * * Argus [32] 4% No * * * * * Compiler SRMT [33] 19% N/A * * * ? ? DAFT [5] 38% N/A * * * ? ? Hardware signature based methods have been proposed to check the correctness of various architectural states. For example, DDFV [31] proposes a dynamic control/data flow checking mechanism, where the faults in the processors can be detected by comparing the dynamic control/data flow with the static signatures in the binary. However, the feasibility of building and verifying dynamic control/data flow during hypervisor execution deserves further investigation. The hypervisor handles various events and requests from hardware and guest VMs, such as exceptions, interrupts and system calls, whose signatures can be very complex. Argus [32] proposes a comprehensive hardware checking mechanism based on DDFV to ensure the correctness of control/data flows, computations and memory, with a small performance overhead ( < 4 percent) and a reasonable area overhead ( < 17 percent). It provides better coverage than DDFV, but suffers from the same problem. Moreover, Argus is designed for simple cores rather than speculative out-of-order cores, making it not suitable for commodity servers used in data centers.
It is worthy to note that although a large body of hardware based fault tolerance solutions have been proposed [29] , [30] , [32] , sadly few commodity servers have adopted them, with exceptions such as IBMS/390 [38] and HP NonStop systems [39] . Thus, it is unlikely that hardware based solutions will appear in future data centers, because the prominence of cloud computing is largely based upon easy virtualization upon cost-effective commodity servers. So, extensive modifications in the current system are likely required to detect failures cases listed in the table. This once again highlights the need for an error resilient hypervisor.
Compiler based RMT such as SRMT [33] and DAFT [5] requires no hardware modification, but comes with high performance overhead (38 percent in DAFT and 19 percent in SRMT). Compiler based methods re-compile the source code of an application, leaving things such as libraries unprotected. Both SRMT and DAFT use a redundant thread to verify computations and data within the applications. The idea of using compiler to assist fault tolerance in the hypervisor is attractive, but extensive efforts will be required to handle low-level system functions such as interrupt/exception handling, atomic operations and multithreaded processing.
Software process-level redundancy has been proposed in [34] , [35] , which normally has lower performance overhead than compiler based approaches. However, their implementations rely on user level threads, such as fork() [34] and ptrace [35] , which are hard to adopt in OS kernels and hypervisors. Therefore, it is difficult for these techniques to detect those failure cases.
Software symptom based detection monitors abnormal program behaviors for low cost error detection [3] , [25] , [26] , which often leverages the symptoms that are already built-in OS or hardware, such as exceptions or the event of branch miss predictions. This approach can have false positive cases and limited coverage. SWAT [26] uses pipeline flushing to recover from failures. Therefore, to successfully recover from faults, the detection latency should be less than 100 micro-opcodes (not x86 instructions) in current superscalar processors, which is shorter than most of hypervisor crashes that we have observed. Further, the crashes involving DomU and Dom0 have even longer delay, and if other operating system events (e.g., context switch) happen before the crashes, the possibility of a successful recovery will be further reduced. Another symptom based method, Shoestring [3] develops a probabilistic approach for fault detection. Here instruction duplication technique is utilized to improve the fault detection coverage. However, Shoestring again requires compiler analysis on applications. Therefore, we consider that these techniques are suitable for imm cases. For same function and another function cases, they will likely detect and recover from them. The Dom0 and DomU cases can still be detected but most likely will not be recovered.
Based on our results and analysis, we discuss two potential techniques that can be leveraged for designing fault tolerance techniques for hypervisor. Two techniques are different in terms of costs and capabilities, and can be selectively used depending on the specific reliability requirements of actual virtual systems. Currently we are designing a prototype based on the observations from this study.
Low-cost behavior based fault detection: Despite significant extensions required, we believe that the symptom based error detection methods can potentially be utilized to build an error resilient hypervisor. Such methods require no special hardware support (i.e., can run on COTS processors). However, symptoms cannot always effectively detect faults within acceptable latencies (e.g., in case 4 -7 in Table 3 , faults have already propagated to Dom0 or DomU). For these cases, even if symptoms are eventually triggered, Dom0 or DomU may have already crashed. Therefore, new techniques are required to shorten the detection latency so that errors can be detected before Dom0 or DomU crashes. Our case studies show that control flow might be altered in those cases. This can be used as a sign for detection. The control flow change is different from invalid control flow (the instruction address of the branch output is not a legitimate branch target or not a valid instruction). In our case, these changes in control flow are valid but incorrect. That is, the instruction address of the branch output is a legitimate branch target, but not the correct one based on the branch condition. Previous works usually focusing on detecting invalid control flow (the former), but not incorrect control flow (e.g., case 4,5, and 7). Incorrect control flow is an early sign indicating errors in the hypervisor, which can be used for shortening detection latency. Our analysis allows us to discover these abnormal behaviors for error detection. In our follow-up work [40] , we design a soft error detection framework, Xentry, based on our analysis. Xentry detects errors by monitoring these abnormal behaviors (including incorrect control flow) with very low performance overhead and short detection latency.
Redundancy based fault tolerance: Redundancy can be potentially leveraged to provide an integrated solution for both error detection and recovery. If one wants to detect soft errors in CPUs, the executions must be duplicated for error checking. To recovery from errors, triple redundancy might be required to vote for correct results. Compared with behavior based fault detection, redundancy requires more resources and it is challenging to implement at the software level. However, it can provide stronger protection to the hypervisor. An example of this approach can be found in DualVisor [41] .
CONCLUSION
In this paper, we propose an efficient fault injection framework to evaluate the reliability of the hypervisor. Utilizing a simulation based method, we have conducted over 46,000 injections on the most frequently used functions in the Xen hypervisor. Compared with random injections, our framework achieves 19 percent more crash samples for error analysis. Moreover, a number of interesting cases have been discovered through our experiments. We have conducted in-depth case studies on undetected cases to analyze the soft error propagation. Finally, we present a set of practical guidelines for designing an error resilient hypervisor. H. Howie Huang received the PhD degree in computer science from the University of Virginia. He is an associate professor in the Department of Electrical and Computer Engineering, George Washington University. His research interests include the areas of computer systems and architecture, including cloud computing, big data, and high-performance computing. He received the NSF CAREER Award, NVIDIA Academic Partnership Award, and IBM Real Time Innovation Faculty Award.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
