Providing application-aware reliability through OS/hypervisor-level techniques by Wang, Long
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2010 Long Wang 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
PROVIDING APPLICATION-AWARE RELIABILITY THROUGH  
OS/HYPERVISOR-LEVEL TECHNIQUES 
 
 
 
 
 
 
 
BY 
 
LONG WANG 
 
 
 
 
 
 
 
DISSERTATION 
 
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Electrical and Computer Engineering 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 2010 
 
 
 
Urbana, Illinois 
 
 
      Doctoral Committee: 
 
           Professor Ravishankar K. Iyer, Chair 
           Associate Professor Steven S. Lumetta 
           Assistant Professor Madhusudan Parthasarathy 
           Assistant Professor Shobha Vasudevan 
 
 ii 
ABSTRACT 
 
Operating systems and hypervisors enable the collection and extraction of rich information on 
application and system execution characteristics. This thesis describes a Reliability MicroKernel 
(RMK) architecture, which provides an infrastructure that enables the design and deployment of 
software modules for providing application-aware error detection and recovery.  
The purpose of the RMK is to provide an automatic approach for low-latency crash/hang 
detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native 
system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a 
device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs 
and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is 
transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel 
source code in a native system or in a VM.  
The implemented RMK modules include OS/application crash detection, system hang detection, 
and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a 
system reboot whenever the watchdog is not reset within a predefined timeout interval. The 
detection latency might be significant because the timeout interval for resetting the watchdog 
timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables 
low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count 
of instructions executed between two consecutive context switches and checking if the count 
exceeds a predefined threshold value. 
The RMK is enhanced to support virtualized environments. Specifically, we present the 
description, implementation, and experimental assessment of VM-µCheckpoint, a VM 
 iii 
checkpointing framework to protect both the guest OS and applications against runtime errors. 
Compared with the existing VM checkpoint techniques, our VM-µCheckpoint has small overhead 
and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints 
per second) to reduce the recomputation necessary when recovering a VM from a failure. The key 
point of VM-µCheckpoint is that we do an incremental checkpoint by considering the whole 
memory of the protected VM as part of the checkpoint.  
The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 
processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for 
installing RMK, but the OS of a native system or a VM is not recompiled.)  
Error injection experiments show that our RMK detects all the crashes and system hangs, and 
VM-µCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental 
evaluation of the RMK using real-world applications shows that we achieve high coverage and 
low false-positive rates for error detection (e.g., no false positives for system hang detection) as 
well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in 
VM-µCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals).  
We also apply a formal method and analytical/probilistic models to verify the capability of our 
system hang detection and to study the availability enhancement provided by the RMK.  
 
 iv 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
To Weili, Father, and Mother 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 v 
ACKNOWLEDGMENTS 
 
I am heartily thankful to my adviser, Prof. Ravishankar K. Iyer, whose guidance, supervision, and 
support were indispensible to the completion of this dissertation. Prof. Iyer’s insightful advice and 
suggestions have enabled me to develop a clear understanding of the dissertation topic. I also 
thank Prof. Zbigniew Kalbarczyk for his guidance during my PhD program. This dissertation 
would not have been possible without the frequent discussions I had with Prof. Kalbarczyk. 
I also had many discussions and chats with my officemates Karthik Pattabiraman and Weining Gu 
while I worked on this dissertation. Arun Iyengar also helped me with the VM checkpoint project 
when I started this work as an intern at IBM Watson Research Center. Here I thank Karthik, 
Weining, and Arun for their great help. 
My PhD committee members Prof. Steven S. Lumetta, Prof. Madhusudan Parthasarathy, and Prof. 
Shobha Vasudevan have contributed valuable comments and opinions to my PhD work. Moreover, 
as a member of the DEPEND group, I received a great deal of help from the members of this 
group. The weekly group meeting gave me many hints and inspirations during the past years. I am 
grateful to my committee and to all the members of the DEPEND group for their help.  
Now I offer my thanks to Weili, my dear wife, and to my parents. It would have been impossible 
for me to complete the seven years of my PhD program and this dissertation without their 
understanding, encouragement, and patience throughout the entire time.  
Lastly, I thank all of those who supported me in any way during my PhD program and with this 
dissertation.  
 vi 
TABLE OF CONTENTS 
 
 
CHAPTER 1 INTRODUCTION ..........................................................................................................................1 
1.1. Low-Latency System Hang Detection .....................................................................................................2 
1.2. High-Frequency Checkpoint of VMs.......................................................................................................3 
1.3. Approach Evaluation................................................................................................................................5 
1.4. Thesis Organization .................................................................................................................................6 
 
 
CHAPTER 2 RELIABILITY MICROKERNEL ARCHITECTURE...................................................................8 
2.1. Introduction..............................................................................................................................................8 
2.2. RMK Framework ...................................................................................................................................10 
2.3. RMK Implementation on Linux and Windows......................................................................................18 
 
 
CHAPTER 3 RMK MODULES.........................................................................................................................22 
3.1. System Hang Detection Module (SHD).................................................................................................22 
3.2. Application Hang Detection Module (AHD) .........................................................................................29 
3.3. Transparent Application Checkpoint Module (TAC) .............................................................................33 
3.4. Experimental Evaluation........................................................................................................................38 
 
 
CHAPTER 4 FORMALIZING SYSTEM BEHAVIOR FOR EVALUATING SHD .........................................51 
4.1. Introduction............................................................................................................................................51 
4.2. Related Work..........................................................................................................................................53 
4.3. System Hang Detector ...........................................................................................................................54 
4.4. System Abstraction and Modeling .........................................................................................................55 
4.5. Modeling System Behavior....................................................................................................................59 
4.6. Implementation ......................................................................................................................................71 
4.7. Experimental Results .............................................................................................................................76 
4.8. Conclusions............................................................................................................................................79 
 
 
CHAPTER 5 CHECKPOINTING VMS AGAINST TRANSIENT ERRORS...................................................81 
5.1. Introduction............................................................................................................................................81 
5.2. VM-µCheckpoint Design.......................................................................................................................84 
5.3. Implementation ......................................................................................................................................95 
5.4. Experimental and Measurement Results................................................................................................98 
 
 
CHAPTER 6 INTEGRATING VM CHECKPOINTING INTO RMK.............................................................107 
6.1. RMK Deployment in Virtualized Environment ...................................................................................107 
6.2. Detecting and Recovering from VM/Application Crashes ..................................................................108 
6.3. Detecting and Recovering from VM Hangs.........................................................................................109 
6.4. Error Injection Experiments................................................................................................................. 110 
 
 
CHAPTER 7 MODEL-BASED ANALYSIS OF VM CHECKPOINTING..................................................... 115 
7.1. Checkpoint Corruption Model ............................................................................................................. 115 
7.2. Availability Model ............................................................................................................................... 119 
 
 vii 
CHAPTER 8 RELATED WORK .....................................................................................................................133 
8.1. Related Work on RMK and System Hang Detection ...........................................................................133 
8.2. Related Work on VM Checkpoint........................................................................................................136 
 
 
CHAPTER 9 CONCLUSIONS ........................................................................................................................141 
 
 
APPENDIX A MODELING COORDINATED CHECKPOINTING FOR  
LARGE-SCALE SUPERCOMPUTERS..........................................................................................................143 
A.1. Introduction.........................................................................................................................................143 
A.2. Related Work.......................................................................................................................................145 
A.3. Target System......................................................................................................................................147 
A.4. Overall Composition of the Model .....................................................................................................151 
A.5. Modeling Computing and Coordinated Checkpointing ......................................................................155 
A.6. Modeling Correlated Failures .............................................................................................................159 
A.7. Experiment Setup................................................................................................................................161 
A.8. Experiment Results .............................................................................................................................163 
A.9. Conclusions.........................................................................................................................................173 
 
 
APPENDIX B CHECKPOINTING OF CONTROL STRUCTURES IN MAIN MEMORY  
DATABASE SYSTEMS...................................................................................................................................175 
B.1 Introduction ..........................................................................................................................................175 
B.2. Target System Overview .....................................................................................................................177 
B.3. ARMOR High-Availability Infrastructure ..........................................................................................178 
B.4. ARMOR-Based Checkpointing...........................................................................................................179 
B.5. Checkpointing Algorithms ..................................................................................................................181 
B.6. Performance Evaluation ......................................................................................................................186 
B.7. Related Work.......................................................................................................................................192 
B.8. Conclusions .........................................................................................................................................193 
 
 
REFERENCES .................................................................................................................................................194 
 
 
 
 1 
CHAPTER 1 
INTRODUCTION 
 
Operating systems enable the collection and extraction of rich information on application 
execution characteristics, including counts of executed instructions, program counter traces, 
memory access patterns, and OS-generated signals. Similarly, hypervisors enable monitoring of 
virtual machine (VM) execution and collection of information on characteristics of the execution. 
Such information can be exploited to design highly efficient, application-aware, error detection 
and recovery mechanisms that are transparent to applications or VMs. 
This thesis describes the design, implementation, and demonstration of a Reliability MicroKernel 
(RMK) architecture. The RMK provides an infrastructure that enables the design and deployment 
of software modules for providing application-aware error detection and recovery.  
The purpose of the RMK is to provide an automatic approach for low-latency crash/hang 
detection and rapid recovery via checkpoint. While some of the existing operating systems may, 
by design, provide reliability support, e.g., IBM AIX [1] and High-Availability Linux [2], their 
emphasis is on achieving system reliability rather than exploiting execution characteristics of 
applications or VMs for reliability improvements.  
We first demonstrate how the RMK works in a native system and then enhance the RMK to work 
in VMs. In a native system the RMK is installed as a device driver, while in a virtualized system, 
the RMK is both installed in VMs as a device driver and deployed as a hypercall (which is like a 
system call) in a hypervisor. Our approach is transparent to applications and VMs; i.e., it is not 
required to modify or recompile the kernel source code in a native system or in a VM.  
 2 
The implemented RMK modules include OS/application crash detection, system hang detection, 
and transparent checkpoint. The architecture exploits processor-level features (debugging and 
monitoring facilities available in the current generations of processors), OS-exported interfaces, 
and hypervisor features (e.g., virtual CPU, shadow paging, etc.) to define a set of basic services. 
These basic services are called RMK pins, which are analogous to hardware pins in providing 
clearly defined functionalities and inputs/outputs. The pins are employed to design specific error 
detection and recovery mechanisms, referred to as RMK modules.  
The attributes of the currently implemented architecture allow the following: 
• Design and deployment of application-aware reliability techniques;  
• On-demand configuration and customization of reliability techniques; and  
• Platform independence of the RMK architecture and modules.  
In addition to the RMK design and implementation as an automatic approach for combining error 
detection, checkpoint, and error recovery, this thesis makes the contributions summarized in the 
following two sections.  
1.1. Low-Latency System Hang Detection 
The operating system may suffer from hangs due to poorly written drivers (e.g. unreleased locks). 
Traditionally an external hardware watchdog is used to force a system reboot whenever the 
watchdog is not reset within a predefined timeout interval. In this case, however, one cannot 
determine whether the OS has crashed/hung, or the heartbeat process, designated to reset the 
watchdog periodically, has crashed/hung. Also, the detection latency might be significant because 
 3 
the timeout interval for resetting the watchdog timer is usually a matter of seconds (to reduce false 
alarms, and the time overhead of the heartbeat).  
The approach in this thesis enables low-latency OS-hang detection. In a system executing a set of 
same-priority tasks, the count of instructions executed between two consecutive context switches 
is a finite number. The underlying fault model for a system hang is that an operating system in a 
hang state does not relinquish the processor, and does not schedule any processes. Based on this 
fault model, if the system hangs, this instruction count grows beyond a limit. So we detect system 
hangs by measuring this instruction count and checking if it exceeds a predefined threshold value. 
Hardware counters in current generation of processors are used in the detection for counting 
executed instructions. In a native system we configure the processor to raise a non-maskable 
interrupt (NMI) when the threshold value is exceeded; in a virtualized environment we instrument 
the hypervisor to periodically check if the threshold value is exceeded. As the time-slice allocated 
by the OS for a process between two consecutive context switches is tens or hundreds of 
milliseconds (100 ms for typical processes in Linux 2.6, and 20/40/60 ms in Windows XP), a 
system hang can be detected within hundreds of milliseconds.  
1.2. High-Frequency Checkpoint of VMs 
Virtual machines play an increasingly significant role in today’s computing environment for IT 
services (e.g., web services, virtual desktops, and databases). Checkpoint and rollback of VMs is 
essential to ensure continuous service availability. Virtual machine monitors (VMMs), such as 
VMware and Xen, provide mechanisms to (a) save a VM state by stopping the VM and dumping 
the execution state into persistent storage and (b) migrate the VM to a remote node (e.g., [3]). 
 4 
This thesis presents the description, implementation, and experimental assessment of 
VM-µCheckpoint, a VM checkpointing framework to protect both the guest OS and applications 
against runtime errors. Advantages of using VM-µCheckpoint include (i) small overhead 
compared with the VM replica-based failover approach, (ii) alleviation of checkpoint corruption 
due to error-detection latency by taking advantage of knowledge of error detection latency (we 
deal with non-fail-stop errors), (iii) high checkpointing frequency—tens of checkpoints per 
second—which reduces the size of each increment when taking a checkpoint and reduces the 
recomputation when recovering a VM from a failure, and (iv) rapid recovery—within one 
second—compared to the stop-and-dump approach provided by VMMs.  
The key point of VM-µCheckpoint is that we do an incremental checkpoint by considering the 
whole memory of the protected VM as part of the checkpoint. Specifically, if a memory page is 
updated during a checkpoint interval, we preserve the original state of the page before any write is 
done; then the memory pages not updated during the checkpoint interval are considered to be the 
VM checkpoint, which significantly reduces checkpoint overhead. This incremental checkpoint (i) 
leverages the fact that when a VM fails its whole memory is still accessible by the hypervisor for 
recovery purposes and (ii) is based on the observation that the chance of a transient error in a VM 
crashing all the VMs and the hypervisor on the same node is relatively small (one case out of 
thousands of experiments that inject errors into VMs). 1  
In addition to the high-frequency checkpointing of VM memory, we also have a design for 
disk-based checkpointing to deal with permanent hardware failures or physical machine failures. 
The disk-based checkpoint is taken in much lower frequency, e.g., once every couple of hours.  
                                                        
1
 My colleague, Weining Gu, conducted this error injection campaign for IBM Power series. The result of the 
campaign is not published yet. 
 5 
1.3. Approach Evaluation 
The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 
processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for 
installing RMK, but the OS of a native system or a VM is not recompiled.) 
The effectiveness and performance of the RMK is assessed by two means: experimental 
evaluation and model-based analysis. Error injection experiments show that our RMK detects all 
the crashes and system hangs, and VM-µCheckpoint successfully recovers VMs from all the 
crashes and most injected system hangs. Split driver mode is used for I/O operations in Xen. 
Because I/O checkpoint is not implemented in the current VM-µCheckpoint prototype, there are 
cases when the shared states between split drivers in two VMs cause inconsistency and cause the 
recovery to fail.  
Moreover, the experimental evaluation of the RMK using real-world applications shows that, by 
exploiting characteristics of application and system behavior, we are able to achieve high 
coverage and low false-positive rates for error detection (e.g., no false positives for system hang 
detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% 
overhead in VM-µCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals).  
We also apply formal method to verify the capability of our system hang detection in the RMK. 
Several corner scenarios that lead to system hang and escape detection are identified by the formal 
method, but these scenarios are unlikely to occur in real systems (e.g., an interrupt coinciding with 
a context switch has a low probability). 
We construct an analytical model and a Markov model to study the availability enhancement 
 6 
provided by VM-µCheckpoint. The availability values computed from these models show that we 
achieve better results than existing migration-based VM checkpointing. For example, for an 
average job duration of 8 hours in a 99%-available VM (on top of a hypervisor with the MTTF of 
1.7 years), we achieve an availability of 99.7%, while the migration-based VM checkpointing 
achieves 97.7%.  
1.4. Thesis Organization 
The thesis is organized into the following chapters: Chapter 2 describes the architecture of the 
RMK. Chapter 3 discusses a number of RMK modules (i.e., error detection or checkpoint 
techniques) on standalone systems. Chapter 4 presents formal method analysis to verify the 
system hang detection technique proposed in Chapter 3. Chapter 5 describes our algorithm of VM 
checkpointing. Chapter 6 discusses how the RMK is enhanced in the virtualized environment and 
how the error detection techniques and VM checkpointing are integrated to enhance VM 
availability. Chapter 7 provides an analysis of the VM checkpointing in enhancing VM 
availability by employing an analytical model and a Markov model. Chapter 8 lists the related 
work, and Chapter 9 concludes the thesis. 
I completed several projects prior to my thesis work. Appendix A presents stochastic models of 
coordinated checkpointing to study the error-present behavior of large-scale supercomputers. We 
found that the resource utilization is largely limited if we only apply the coordinated 
checkpointing in the large supercomputer (more than 50% of resources are spent in checkpointing 
and recovery). So it is crucial to enhance the single-node reliability rather than just apply 
coordinated checkpointing. This motivates my thesis work to provide an automatic approach for 
low-latency error detection and rapid checkpointing/recovery.  
 7 
Appendix B presents my previous work on incremental checkpointing in a main-memory 
database application. My experience is that the system downtime is reduced by 5 times through 
exploiting the knowledge of application semantics. This shows the need for monitoring 
application/system behavior and collecting the information on application/system semantics, and 
leads to the RMK framework in this thesis.  
 
 8 
CHAPTER 2 
RELIABILITY MICROKERNEL ARCHITECTURE  
 
2.1. Introduction 
The Reliability MicroKernel (RMK) architecture, deployed as a device driver, provides an 
infrastructure that enables the design and deployment of software modules for providing 
application-aware reliability services. The implemented RMK modules include OS/application 
crash detection, system hang detection, and transparent checkpoint. The architecture exploits 
processor-level features (debugging, and monitoring facilities available in the current generations 
of processors), and OS-exported interfaces to define a set of basic services. These basic services 
are called RMK pins, which are analogous to hardware pins in providing clearly defined 
functionalities, and inputs/outputs. The pins are employed to design application-specific 
mechanisms, referred to as RMK modules, for the runtime system monitoring, and low-latency 
error detection.  
The attributes of the currently implemented architecture allow the following. 
• Design, and deployment of application-aware reliability techniques. The RMK wraps (or 
abstracts out) the functionalities of the OS, and the underlying hardware into RMK pins. 
Using the interfaces, and services exported by the pins, designers of reliability modules can 
focus on developing application-specific techniques/algorithms without detailed knowledge 
on the specifics of the OS, or the underlying hardware. Several techniques have been 
implemented to demonstrate how OS knowledge on application execution can be used to 
 9 
provide error detection, and recovery. Table 1 briefly summarizes the techniques currently 
available in the RMK.  
Table 1: Reliability Mechanisms in the RMK 
Reliability 
Service 
Application Execution 
Pattern 
Technique Description 
Application 
Hang 
Detection 
Number of instructions 
executed within a 
well-defined code block, 
e.g., a loop 
Number of instructions executed within the 
code block is counted. If the count goes 
beyond a preset scope, a hang is flagged.  
Application 
Crash 
Recovery 
OS signal delivered to 
application 
Delivery of the terminating signal to a 
process is intercepted. If the signal is not 
handled, the process is recovered from its 
checkpoint. 
Transparent 
Application 
Checkpoint 
Memory access patterns Original pages written during the checkpoint 
interval are backed up. If the application fails 
while the system is still operational, the 
original pages are restored.  
System Hang 
Detection 
Number of instructions 
executed between two 
consecutive context 
switches  
The count of instructions executed between 
two consecutive context switches is a finite 
number. If the system hangs, it does not 
schedule processes, and the instruction count 
goes beyond the preset scope. 
• On-demand configuration, and customization of reliability techniques. The RMK enables 
on-demand configuration of reliability support provided to applications. RMK pins, and RMK 
modules can be installed or removed on demand.  
• Platform independence of the RMK architecture and modules. The architecture of the 
RMK is independent of which platform (hardware, and operating system) the RMK is on. 
Moreover, the set of RMK modules implemented on an operating system (e.g. Linux) can be 
deployed onto other systems (e.g. Solaris, FreeBSD, or Windows systems) with minimal or no 
code changes, as long as corresponding RMK pins (the platform-dependent components) are 
implemented, and compiled for the target operating system. Currently, the RMK has been 
implemented on both Linux and Windows systems. 
 10 
Fault/error injection experiments conducted to evaluate the efficiency of the RMK 
implementation show that the OS-level mechanisms detect all application, and system hangs (due 
to injected errors) with a very low false positive rate (1 out of 2000 experiments for application 
hang detection, and no false positives for system hang detection).2 Additionally, the RMK-based 
mechanisms provide low-latency detection, and transparent checkpointing with little impact on 
system performance (0.6% performance overhead for the application hang detection, and 0.1% 
overhead for the transparent application checkpointing).  
2.2. RMK Framework 
The Reliability MicroKernel (RMK) framework is developed as a loadable kernel module (or a 
device driver) in the operating system. The RMK has a two-level hierarchy, shown in Figure 1.  
The lower level (RMK pins) interfaces with the system, and hardware, while the upper level 
(RMK modules) hosts application-specific detection, and recovery techniques. RMK pins 
encapsulate low-level services of the system, and of the underlying hardware. Well-defined pin 
interfaces (available to users) can be selectively used to build RMK modules. Each RMK module 
implements a specific error detection or recovery mechanism, e.g., crash detection, and recovery 
of applications; hang detection of applications; hang detection of the operating system; or 
transparent application checkpoint, and recovery. The RMK core between the two levels manages 
the installation, and de-installation of pins, and modules; it also invokes pin functionalities on 
behalf of RMK modules. A set of RMK API is provided by the RMK core for the management 
purpose. 
                                                        
2
 The 100% coverage and the false positive rate of 1 out of 2000 are the observed results of the conducted 
experiments. 
 11 
Hardware
OS
modules
Applications
core
RMK
pins
pin 
interface
 
Figure 1: The RMK in the System 
The abstraction introduced by the two-level RMK structure enables: (i) the use of standard 
operating system functions to create service essentials in designing, and implementing reliability 
mechanisms (encapsulated as RMK modules); (ii) the portability of RMK modules across 
operating systems, because pins implemented on different platforms export the same interfaces 
(no need to modify the module code); and (iii) transparent coordination between multiple modules, 
and pins to avoid potential conflicts. For example, two RMK modules, system hang detection and 
application hang detection, both intercept process context switches to detect hangs. The RMK 
coordination mechanism handles scenarios such as the installation of application hang detection 
while system hang detection is already in place, and allows the two modules to function without 
conflicts. 
2.2.1. System-Level RMK Interface 
The RMK interfaces with the operating system (or more generically with the runtime environment) 
via RMK pins. A pin uses a collection of OS functions to construct a specific service that is 
essential in providing reliability mechanisms. In this sense, a pin implementation is 
platform-specific.  But while the pin implementation on different platforms may differ, the pin 
 12 
functionality and exported interface are intended to be the same, regardless of the platform.3 For 
example, Figure 2 depicts the architecture (Figure 2a) and implementation (pseudocode in Figure 
2b) of the P_SCHL pin in Linux to intercept the process context switch, which is an essential 
service in enabling application, and/or system hang detection.4 
OS
Interrupt table
P_SCHL
RMK pin
operations
(INTERCEPT, RESTORE)
event 
(EVT_CONTSWITCH)
RMK coreDispatcher
pin interface
Debug register
 
 
(a) architecture 
int p_schl_init(void) {
…
rmk_register_pin(P_SCHL, p_schl_oper);
…
}
int p_schl_oper(int operation_ID, void* arg) {
get the interrupt vector# from arg;
switch (operation_ID) {
case INTERCEPT: // intercept context switches
preserve the original debug exception handler;
set the debug exception handler to my_DB_exp();
set one debug register (DR3) to the scheduler function (schedule()); 
enable DR3 to raise debug exception; 
break;
case RESTORE: // restore the original setting
disable DR3 to raise debug exception;
clear DR3;
restore the original debug exception handler;
break;
}
…
}
int my_DB_exp(void) 
{
rmk_raise_event(P_SCHL, EVT_CONTSWITCH);
call original debug exception handler;
// the original handler disables debug exception
set DR3 to scheduler();
enable DR3 to raise debug exception;
}
(b) implementation code  
 
 
Figure 2: The Implementation of an Example Pin, P_SCHL in Linux 
In modern operating systems, context switches are transparent to applications; i.e., an application 
is not notified when a context switch occurs. A common method of intercepting context switches 
would be to instrument the scheduler. The P_SCHL pin, however, takes advantage of the debug 
exception mechanism to intercept context switches without instrumentation of the operating 
system source code.  
Generically, the interface exported by a pin consists of the following two sets.   
 A set of operations the pin can perform. In the case of P_SCHL the operations include 
INTERCEPT, and RESTORE (see Figure 2).  
                                                        
3
 It is assumed that the basic functionalities (e.g., scheduler) are available across the operating systems considered 
here. 
4
 In the Windows operating system, the pin interface for P_SCHL is the same, but the implementation is slightly 
different. 
 13 
 A set of events the pin can produce given a trigger condition (a process context switch in the 
case of P_SCHL pin). In the case of P_SCHL, the event set includes EVT_CONTSWITCH 
(see Figure 2). 
The INTERCEPT operation within the P_SCHL pin performs two primary tasks: (i) it writes the 
scheduler entry point (the address of the first instruction of the schedule()) into one of the 
processor breakpoint registers (DR3 in Pentium 4), thus forcing the processor to raise a debug 
exception whenever schedule() is invoked,5 and ii) it modifies the system’s interrupt vector table 
so that the debug exception interrupt vector points to the custom exception handler, my_DB_exp(), 
which (in addition to the default debug exception handler) generates an event, 
EVT_CONTSWITCH, to indicate the context switch.  
A small set of RMK API (use_pin (), release_pin (), issue_cmd (), subscr_event (), and 
unsubscr_event ()) is implemented to enable transparent subscription to events, and the invocation 
of pin operations by RMK modules. In our example, the OS, and/or application hang detection 
module subscribes to the event exported by the P_SCHL pin. The subscription table (i.e., the 
mapping between an event, and a module or modules) is maintained by the dispatcher (see Figure 
2a) and populated at the time of module initialization.  
Although RMK pins deal with system specifics, no kernel source patch or recompilation is needed 
for pin implementation or deployment.  This is because the interfaces exported by operating 
systems, and the debugging and monitoring facilities available in modern processors, can be 
exploited for this purpose. For example, Linux exports a large number of kernel functions, and 
variables in a symbol list stored in the file /boot/System.map (23,306 symbols in Linux 2.6.11.3). 
                                                        
5
 In Linux, schedule() is always invoked when a context switch occurs. 
 14 
Table 2 lists the RMK pins currently implemented in the RMK prototype. 
Table 2: RMK Pins in the RMK Prototype 
RMK 
Pin 
Hardware/OS Features Pin Functionalities 
P_PMC Hardware counters available in modern 
processors; permit monitoring and 
measuring processor performance 
parameters, including instruction 
count, TLB access, and cache 
references. 
Configures the counters to measure 
specific parameters; starts/stops counting; 
reads/writes counters; generates a PMI 
interrupt to APIC when a counter reaches 
zero. 
P_APIC Advanced Programmable Interrupt 
Controller (APIC) provided in modern 
processors. APIC receives interrupts 
from processor pins or external 
interrupt controller and delivers 
interrupts to the processor core. 
Configures APIC for custom interrupt 
generation, e.g., generating an NMI 
interrupt when a performance monitor 
counter raises a PMI to APIC. 
P_DBR Debug facilities available in hardware, 
including debug exception and debug 
registers. 
Sets up a debug exception at a particular 
location; installs debug exception handler; 
generates an event upon debug exception.6 
P_INTR OS-level interrupt handling. Installs hooks for specific interrupts; 
generates an event upon an interrupt. 
P_SIG Signals delivered by the operating 
system to processes. 
Intercepts particular signals to processes; 
generates an event upon a signal delivery. 
P_MEM Memory management provided in the 
OS. 
Copies memory pages; sets page 
properties. 
P_SCHL Process context switch supported by 
OS scheduler. 
Intercepts context switches; generates an 
event upon a context switch. 
P_PERI System-level periodic jobs performed 
by the work queues of the OS. 
Sets up a system-level periodic job at a 
specified interval; generates an event upon 
interval expiration. 
P_NET Network communications performed 
by the system. 
Intercepts network activities, including 
message sending and arrival; generates 
events upon these activities. 
                                                        
6
 The P_DBR pin handles use of debug registers for debugging. When the debugged application is switched off the 
CPU, the pin saves the current value of the registers and sets new values. When the debugged application is scheduled 
onto processor, the pin restores the saved register value. In cases when debug registers are used to intercept process 
context switch (like the system hang detection module on Linux), a binary rewriting technique can be applied to 
implement context switch interception, and the debug registers are not used (like the system hang detection module 
on Windows). 
 15 
Table 2: Continued 
P_SYSC A variety of system calls executed by 
the system on behalf of applications. 
Intercepts specific system calls; generates 
an event upon a system call. 
P_FILE Functions to manage the table of open 
files and read/write operations. 
Wraps the manipulation of the file table; 
generates an event when a monitored file 
read/write completes. 
P_IPC IPC mechanisms, such as pipes, 
message queues, and semaphore. 
Wraps the manipulation of IPCs; generates 
an event when a monitored IPC occurs. 
2.2.2. Application-Level RMK Interface 
Applications are monitored by RMK modules, which implement application-specific detection, 
and recovery techniques. While most modules are application-transparent, for some RMK 
modules the application may need to be instrumented, for example by using an enhanced compiler, 
to insert a system call in the application code to enable invoking a given RMK module. 
An RMK module can protect a set of applications on the system. The RMK allows users to 
associate a set of applications with RMK module(s). This information is kept in the dispatcher of 
the RMK core. Moreover, RMK modules can be installed and removed on demand by users. A 
specially designed RMK module, Configuration Manager, enables RMK reconfiguration. 
2.2.3. RMK` Core 
The RMK core is responsible for: (i) maintaining mapping between the events generated by pins 
and the modules subscribed to these events; (ii) delivering the events to the modules; (iii) 
dispatching operation requests to the pins; and (iv) managing and configuring the pins and 
modules (e.g., installation/removal). The RMK core consists of four components, illustrated in 
Figure 3, and listed here.  
 Pin manager is responsible for registration, and loading/unloading of RMK pins. 
 16 
 Module manager handles installation, and removal of RMK modules.  
 Dispatcher maintains subscription information for events generated by pins (i.e., mapping 
between events and RMK modules), and delivers the events to modules according to the 
subscription list. 
 RMK communication channel facilitates remote invocation of pins in a networked 
environment, i.e., enables requesting a pin operation, or subscribing to an event generated by a 
pin on a remote node. This remote pin invocation facilitates design of distributed reliability 
mechanisms in RMK.  
OS
Pin Manager
P_APIC P_PMC P_INTR P_SIG P_SCHL
HW
Dispatcher
Module Mgr
Sys Hang 
Detection
eventoperation req.
RMK Comm.
Channel
event
operation 
req.
RMK modules
App. Hang 
Detection
App. Crash 
Detection
Trans. App. 
Chkpt
… …
RMK pins
eventoperation req.
RMK core
event / 
operation request
management
configuration
configuration
network
 
Figure 3: RMK Architecture 
On-demand configuration of the RMK. RMK modules and pins can be deployed and removed 
on demand. When a module is installed, it acquires services from a set of pins. If the required pins 
are not available in the RMK (e.g., specific pins for a new module), they are automatically loaded 
into memory from permanent storage, e.g., a disk. The on-demand RMK configuration is initiated 
by the dispatcher, and carried out by the pin/module managers (see Figure 3).  
 17 
We now provide an illustration of how the RMK is configured to meet the needs of the specific 
error-detection mechanisms. Assume that the RMK has installed only one module, the system 
hang detection module, which loads five pins (P_SCHL, P_PMC, P_APIC, P_DBR, P_INTR) 
into memory. A new module, the application hang detection module (AHD), is to be deployed to 
protect an application. The AHD module uses six pins: P_SCHL, P_PMC, P_DBR, P_INTR, 
P_PERI, P_SYSC. Now, during initialization of the AHD module, services from these six pins are 
requested via the RMK API. On receiving the service requests, the dispatcher forwards them to 
the pin manager, which finds that P_PERI, and P_SYSC are not loaded into memory. Recall that 
the pin manager maintains the state of all the pins in the local host indexed by the pin ID. The pin 
manager then loads the two additional pins from the disk, and sends a success response to the 
module via the dispatcher. If the required pins are not found, an error response is sent. After 
successful module initialization, the dispatcher configures the event-subscription table to 
establish corrected mapping between the newly deployed AHD module and the events generated 
by the pins.  
RMK self-checking. Errors in the RMK may cause system failures. The dispatcher, pin/module 
managers, and communication channels are well implemented, and thoroughly tested. Due to the 
simplicity of RMK pins, errors in pin implementations can be considered negligible. Errors in 
RMK modules are confined using the following method. Whenever an operation of a module is 
invoked, the RMK records the module ID. The record is cleared after the operation is finished. As 
there is no recursive invocation of module operations, the recorded ID always indicates the 
currently executing module. When an error in a module triggers a kernel exception, the RMK 
intercepts the exception through the P_INTR pin, checks the recorded module ID, and unloads the 
module (the module can be reloaded immediately after the unloading if that is preferred).  
 18 
2.3. RMK Implementation on Linux and Windows 
The decoupling of RMK modules (platform-independent) and RMK pins (platform-dependent) 
allows for implementing the RMK architecture and RMK modules only once (except for the 
RMK modules with platform-specific semantics), and re-implementing the RMK pins 
accordingly on different platforms. RMK pins are small in terms of the code size (around a few 
hundred lines for each pin). Figure 4 and Figure 5 depict the implementation and deployment of 
the RMK on Linux 2.6.11, and Windows XP Professional (SP2), respectively. In both figures, the 
RMK is deployed as a loadable kernel module (or a device driver). Construction of a set of sample 
RMK pins is also shown in the figures. 
The RMK on Linux and Windows (RMK.ko, and RMK.sys, respectively) have the same 
architecture, and RMK modules have the same implementation without major code changes. 
However the implementation of RMK can be: (i) OS dependent if a pin uses OS specific services, 
or (ii) OS independent if a pin interacts only with the hardware rather than with the OS. The rest 
of this section discusses examples of pins from the two categories.  
 19 
Performance 
Monitor 
Counters
Local APIC
Interrupt Table 
(IDT)
Scheduler: 
schedule()
Assembly 
code
Memory-
mapped 
I/O
SIDT
RMK Core
Sys. Hang 
Detection
Hook thru debug 
reg. and exported 
symbol
System Call Table
(sys_call_table)
Hook thru 
exported 
symbol
P_APIC P_PMC P_INTR P_SCHL P_SYSCRMK 
Driver
RMK.ko
Process mgr Memory mgr
File system
& cache mgr
System Call (syscall)
I/O mgr
Kernel 
vmlinuz
Device 
& 
File 
System 
Drivers
Linux OS
HW
Debug 
Register
DMA MMU …
Signal mgr
install 
signal 
handler
P_SIG
App Hang 
Detection
Trans. app 
checkpoint
…
IPC
App Crash 
Recovery
 
Figure 4: The RMK Implementation, and Deployment on Linux 2.6.11 
 
Performance 
Monitor 
Counters
Local APIC
Interrupt Table 
(IDT)
Dispatcher: 
SwapContext()
Assembly 
code
Memory-
mapped 
I/O
SIDT
RMK Core
Inline 
hooking
Kern. Serv. Desc. Table 
(KeServiceDescriptorTable)
Hook thru 
exported 
symbol
P_APIC P_PMC P_INTR P_SCHL P_SYSCRMK 
Driver 
RMK.sys
Process mgr
Registry mgr Memory mgr
Security mgr Object mgr
File system
& cache mgr
System Service (syscall)
GUI mgr I/O mgr
Kernel 
ntoskrnl.exe
Device 
& 
File 
System 
Drivers
Windows 
OS
Hardware Abstraction Layer (HAL)
HW
Debug 
Register
DMA MMU …
Structured Excep. 
Handling & Console 
Control Handling
Hook thru 
WinAPI
P_SIG …
IPC
Sys. Hang 
Detection
App Hang 
Detection
Trans. app 
checkpoint
App Crash 
Recovery
 
Figure 5: The RMK Implementation and Deployment on Windows XP Professional 
 20 
For example, the implementation of the P_SCHL pin, which intercepts process context switches, 
and issues the EVT_CONTSWITCH event, is OS-dependent. The context switch routine on 
Linux, schedule(), is exported by the Linux kernel, and can be intercepted by setting a breakpoint 
at the entry of schedule(). Whenever schedule() is invoked, a breakpoint handler is invoked, 
which generates EVT_CONTSWITCH, and delivers it to the RMK core. On Windows, the 
context switch routine, SwapContext(), is not exported and can be intercepted by applying the 
kernel inline hooking technique [4], which allows searching for predefined instruction patterns in 
the kernel code space (ntoskrnl.exe). The technique is used to search for a sequence of instructions 
found at the beginning of the SwapContext(). These instructions are the same for all Windows XP 
systems and can be obtained by analyzing the ntoskrnl.exe file offline. After locating the 
SwapContext(), the first instruction of SwapContext() is replaced with an unconditional jump to 
the hook function;7 and at the end of the hook function the replaced instruction is executed, and 
the control flow jumps back to the instruction in SwapContext() that follows the replaced 
instruction.  
Another pin with different implementations on Linux versus Windows is P_SYSC, which 
intercepts system calls. Although the system call tables on both of the platforms (sys_call_table in 
Linux, and KeServiceDescriptorTable in Windows) are exported by the kernel, they each have 
specific data structures which require different handlings.  
The P_APIC, and P_PMC pins can be reused without a reimplementation because they interact 
with the hardware instead of the operating systems. Similarly, the P_SCHL pin for Windows can 
be reused in any Windows system on different hardware. Also note that RMK pins do not need to 
be reimplemented across variants of a platform. So the implementation of the P_SCHL pin on 
                                                        
7
 If the size of the first instruction is less than the size of the jump instruction, two or more instructions are replaced. 
 21 
Linux 2.6 works on Linux 2.4, and the implementation of the P_SCHL pin on Windows XP works 
on the Windows 2003 Server as well.  
RMK use scenario. RMK-supported services are portable across different platforms (hardware, 
and operating systems). As a result, a broad range of applications can potentially benefit from 
RMK. Consider a cluster of web servers on a number of heterogeneous nodes (Linux, and 
Windows). These web servers may crash or hang due to many causes including corrupted (or 
invalid) input, or bugs in poorly written device drivers. With the RMK installed on every node, 
the server downtime and the system downtime can be greatly reduced, and the availability of web 
services can be largely enhanced. With the platform-independence of RMK modules, the RMK 
can scale to embedded devices, e.g., smart phones, and enhance reliability without major change 
to module design and implementation. Furthermore, RMK-based services can be extended to 
provide the protection against malicious tampering with the system, e.g. monitoring access to 
critical application data, or system resources. 
 22 
CHAPTER 3 
RMK MODULES 
 
This chapter discusses research and implementation challenges faced in developing several RMK 
modules.  
3.1. System Hang Detection Module (SHD) 
The operating system is subject to hangs due to, e.g., poorly written drivers with blocking 
operations,8 and unreleased locks. Chou et al. [5] claim that 34% of kernel bugs in Linux 2.4.1 
potentially lead to system hangs. The processor may be executing non-HLT instructions, or be 
halted by executing a HLT instruction during a system hang. The System Hang Detection module 
(SHD) provides a low-latency detection of and recovery from system hangs. 
An external hardware watchdog can be used to force a system reboot whenever the watchdog is 
not reset within a predefined timeout interval. In this case, however, one cannot determine 
whether the OS has crashed/hung, or the heartbeat process, designated to reset the watchdog 
periodically, has crashed/hung. In either situation, the system reboot is initiated, although in the 
latter case, a system reboot might be not necessary. Also, the detection latency might be 
significant because the timeout interval for resetting the watchdog timer is usually a matter of 
seconds (to reduce false alarms and the time overhead of the heartbeat).  
OS kernel instrumentation and watchdog timer modifications are required to reduce latency in 
                                                        
8
 For example, a device driver, after interrupt is disabled, performs a blocking operation to acquire a lock that is held 
by another process. As the blocking mutex-acquisition operation is performed in the kernel, the system hangs. 
 23 
detecting system hangs and to avoid false positives (i.e., unnecessary OS reboot in the case of a 
heartbeat process crash/hang). For example, one could instrument the OS scheduler to send a 
heartbeat message to the watchdog on every context switch. If no heartbeat arrives within a 
certain time interval (i.e., the operating system does not schedule processes), a system hang is 
declared.  
The approach in this thesis enables a low-latency, transparent OS-hang/crash detection. The 
detection mechanism is designed and implemented as a light-weight kernel driver; and hence, it 
does not require instrumentation of the kernel code and is fully transparent to the system.  
The underlying fault model for a system hang is that an operating system in a hang state does not 
relinquish the processor and does not schedule any processes. Based on this fault model, system 
hang detection can be constructed as follows. 
1. Use OS-level counters to separately track instructions executed by the application processes 
and the operating system. 
2. Observe periodically (e.g., on each context switch) whether the instruction count in each 
counter changes between the consecutive readings. 
3. If the contents of the OS counter continue to change and the counters associated with 
application processes are frozen, this is a clear indication of a system hang.  
Two important issues are: (i) to determine a time period during which to measure the dedicated 
OS- and application-level instruction counters; and (ii) to have a mechanism to continuously and 
accurately measure the number of instructions being executed in the selected time period.  
 24 
In a system executing a set of same-priority tasks, the count of instructions executed between two 
consecutive context switches (continuous-execution-instruction-count) is a finite number. If the 
system hangs, it may not schedule processes. As a result, the instruction count grows beyond a 
limit. In principle, this is determined by (i) the time-slice allocated by the OS for a process 
between two consecutive context switches (100 ms for typical processes in Linux 2.6, and 
20/40/60 ms in Windows XP), and (ii) the fact that the count of executed instructions within the 
time slice has a bound (calculated as a product of the CPU speed, and the time slice). However, in 
most cases, this bound is a large number due to the fact that a process voluntarily yields the CPU 
when it is waiting on resources (e.g., asynchronous I/O input), or idling for a certain time period 
(e.g., sleep functions or blocking synchronous I/O operations). To provide low-latency system 
hang detection, a much lower bound, based on profiling of system execution, can be applied for 
the continuous-execution-instruction-count. A history of instruction counts collected over several 
time slices is used to guide the estimation of the low bound.  
The SHD module implements system hang detection using two counters:  
• Profiling counter, which keeps track of the number of instructions executed in the current 
time slice by a process; and 
• Checking counter, which maintains a “check value” that is the running count of the 
maximum number of instructions executed in a time-slice, and obtained over a sliding 
window of about 50 time-slices.   
Figure 6 illustrates the principle the SHD applies for system hang detection. The horizontal axis 
represents instruction-stream execution on the processor. The dashed and solid lines in the figure 
illustrate the evolution of the values in the two counters as a function of instruction execution. The 
 25 
profiling counter (the dashed line) increments as instructions execute for each process during a 
time-slice. When a context switch occurs, the counter value is recorded by the SHD, and the 
counter is reset to zero to prepare for counting the instructions for the next scheduled process.  
instruction 
execution
process (user and 
kernel mode)
scheduler
profiling counter
checking counter
counter value
0
OS hang/panic
application 
process 
execution
scheduler 
execution
 
Figure 6: System Hang Detection Using Instruction Counting 
The checking counter (the solid line) detects system hangs. It is set to an initial value, the check 
value, after each context switch, and it decrements with instruction execution. The check value is 
estimated according to the profiled count history, as discussed above. One can see from Figure 6 
that, when a higher value is recorded by the profiling counter, the check value increases. The 
instruction counts recorded over a sliding window of 50 recent time-slices are used to compute the 
check value that is to be loaded into the checking counter.  
Recall that the fault model for system hang detection assumes that the OS continues to execute 
instructions without relinquishing control to application processes, i.e., the context switch is no 
longer invoked. Thus, the checking counter is not reset, and finally reaches zero (indicated as a 
bell alarm in Figure 6). At that time, the counter raises a performance counter overflow interrupt 
 26 
(also known as performance monitor interrupt, or PMI) to the APIC, which then issues an NMI 
(non-maskable interrupt) to the processor. The NMI handler initiates system reboot. 9  The 
profiling-based checking makes the detection low-latency while adapting to changes in the system 
execution.  
3.1.1. Issues 
Hangs due to a halted processor. Recall that, according to our earlier definition, the OS in a 
hang state indefinitely executes instructions without relinquishing the processor. In addition to 
this fault model, we consider another case of system hang in which the OS halts the processor 
while waiting for a response from a blocking I/O operation [6] that fails for some reason; or the 
OS executes a halt instruction due, for example, to an error in a control flow of a program. 
When the processor is in the halt state, performance monitor counters do not work. However, a 
timestamp counter still counts clock cycles [6] and can be used to detect system hangs due to a 
halted processor. Upon a context switch, the timestamp counter is set to a fixed timeout value 
(rather than to an instruction count). The counter decrements at each clock cycle and flags a 
system hang when it reaches zero (forcing the system reboot). The fixed timeout value for hang 
detection should be set sufficiently large, e.g. 5 seconds, to avoid false alarms, as blocking I/O 
operations (even when not in error) may take a long time.10 
Check Value Determination. The check value is naturally bounded as the product of the 
processor speed and the time-slice. A tighter bound is obtained as the running count of the 
                                                        
9
 More advanced, and sophisticated system recovery can be designed. For example, the erroneous device driver can 
be localized, uninstalled, and reinstalled. The involved application processes are recovered accordingly. 
10
 One may expect to exploit the timestamp counter to detect indefinite execution. But the large timeout value results 
in long-latency hang detection. 
 27 
maximum number of instructions executed in a time-slice within a sliding window of a predefined 
number of time-slices. However, if the system transitions from an idle state to a busy one, the 
determined check value may be small, causing a false alarm. In this case, a default check value is 
applied (the product of the processor speed, and half the time-slice) in the current prototype. Note 
that a larger default value reduces false alarms, but increases hang-detection latency. 
Priority-based scheduling. We assume that in a balanced system all tasks have the same priority. 
Although Linux and Windows support priority-based scheduling, many practical systems usually 
deal with equal-priority tasks. In these systems, kernel operations (e.g., system calls, and 
interrupt/exception handlers) are completed within a time far less than the process time-slice 
appropriately selected during the system design. Consequently, SHD is able to detect system 
hangs successfully. For systems with prioritized tasks, the check values can be set by profiling 
instruction counts executed in the time-slices of the high-priority tasks.  
3.1.2. SHD Implementation 
Three pins, P_SCHL (scheduler interceptor), P_APIC (advanced programmable interrupt 
controller), and P_PMC (performance monitor counter), support the SHD module. To illustrate 
how the SHD works, the behavior of the module in Linux, as an example, is depicted in Figure 7 
(with implementation of the pins shown). The solid and dotted lines illustrate the control flow of 
the processor execution. Circled numbers indicate the step order.  
An application process, app1, enters the kernel mode to execute a system call, or to handle an 
interrupt (step 1 in Figure 7). After the kernel mode processing completes, the Linux system 
checks whether the time-slice assigned to app1 is used up. If the time-slice is used up, the system 
invokes the scheduler (step 2). In a typical case, when the RMK is not deployed, the scheduler 
 28 
completes the context switch, and transfers the control to the next application process, app2 (step 
9, the dotted line).  
app1 kernel code
app1 user code
app2 kernel code
app2 user code
scheduler:
Debug Control Reg
Breakpoint Reg
Divide error
Debug excep
INT 3
NMI
Overflow
# 0
# 1
Interrupt Vector Table
my_DB_exp:
orig_DB_exp code
1
2
3
4
5
6
7
8
9
10
MSR_IQ_Counter2
MSR_IQ_Counter0
Time Stamp Counter
os_hang_reboot:
reboot
Perf. Mon. 
Interrupt
Time Stamp 
Interrupt
NMI Processor 
Core
NMI Generator Local APIC
DB excep. 
Fault
.
.
.
os_hang
_check
 
Figure 7: Implementation of System Hang Detection in Linux 2.6.11 on a Pentium 4 
When the RMK is deployed, the scheduler entry point (the address of the first instruction of the 
scheduler) is written into a breakpoint register, and the debug control register is properly 
configured so that whenever the instruction at the breakpoint is to be executed, a debug exception 
fault is raised by the processor (steps 3, 4). The DB exception interrupt handler registered in the 
system’s interrupt vector table points to the custom exception handler, my_DB_exp (step 5). This 
exception handler performs three tasks: (i) it reads, and sets hardware counters in the processor, 
and executes the detection algorithm (i.e., os_hang_check()); (ii) it calls the original debug 
exception handler (step 6); and (iii) it writes the scheduler entry point into the breakpoint register, 
configures the debug control register to enable repeating the same processing on the next context 
switch, and then returns to the scheduler11 (step 8). The third task is required because the original 
debug exception handler by default disables the debug exception. After returning to the scheduler, 
                                                        
11
 A do-not-retrigger bit in a hardware register should be set here to successfully return to the scheduler. Otherwise, 
the debug exception fault is triggered again upon the return, resulting in infinite recursive fault triggerings. 
 29 
the typical processing resumes (the next process is scheduled) (steps 9, 10). Steps 3-8 are 
provided in the P_SCHL pin with the support of P_DBR, and P_INTR.  
Two performance counters, MSR_IQ_Counter0 and MSR_IQ_Counter2, are used for profiling 
and checking, respectively. The Time Stamp Counter is used to handle processor-halt cases. When 
a system hang occurs, MSR_IQ_Counter2 decrements to zero, and triggers a performance 
monitor interrupt (PMI) into the NMI generator in the local APIC. The NMI generator raises the 
NMI to the processor, where the interrupt is converted to the EVT_NMI event, and the registered 
callback function, os_hang_reboot(), reboots the system. The corresponding control flow is 
depicted by dashed lines in Figure 7.  
3.2. Application Hang Detection Module (AHD) 
Applications (or individual application threads) may hang due to incorrect function parameters, 
hardware faults, deadlock, or failed I/O. For example, it is reported in [7] that 6 out of 11986 
experiments with bad parameters lead to hangs of POSIX functions in Linux 2.0.18. The failed 
thread is either continuously scheduled on the processor (as in an infinite loop), or not scheduled 
at all (e.g., waiting for wakeup on completion of asynchronous I/O operations). The Application 
Hang Detection module (AHD) transparently detects application hangs. The AHD exploits the 
fact that the count of instructions executed in a well-defined code block or code section can be 
statistically bounded [8].12 If the instruction count goes outside the bound, the application hang is 
flagged.  
A code section is a block of code with a single entry. A code section is monitored as a unit by the 
                                                        
12
 Note that the problem of determining the worst case execution time or instruction count of a program or code block 
is, in general, not decidable, and is equivalent to the halting problem. 
 30 
AHD. After a thread enters a code section, it must leave the section before entering the section 
again. Examples of code sections include a loop, a loop body (i.e., a single iteration of a loop), a 
function, and a mutex block (a code block between mutex acquisition, and release). A recursively 
called function is not monitored as a code section. A code section may include multiple nested 
sections (e.g. in a form of nested loops) and code sections may overlap. Each code section has a 
unique ID within a process. Multiple threads may execute the same code section simultaneously. 
The AHD keeps a logical counter for every thread running in a monitored code section. 
Consequently, an array of counters monitor a multithreaded application, as shown in Figure 8. 
When a thread enters a code section, the corresponding counter starts counting instructions from 
zero, and the counter is called active; when it leaves the section, the counter stops counting, 
becoming inactive. Because the thread may be executing an instruction within multiple code 
sections, multiple counters for the thread may be active at the same time. The granularity of code 
sections can be selected to limit the number of sections, and thus avoid large overhead. Typically, 
3-6 code sections are monitored for an application. 
Though the AHD uses an array of logical counters for application hang detection, only one 
hardware performance monitor counter is used for counting. By using the hardware counter, and 
runtime system monitoring, the AHD detects application hangs with low overhead and low 
latency. 
 31 
code section 1
thread 1 thread 2 … thread m
code section 2
code section n
.
.
.
check 
value
a value
a counter
time stamp
 
Figure 8: Counter Array Monitoring a Multithreaded Application in the AHD 
3.2.1. Issues 
Detecting an Unscheduled Thread. A timestamp is introduced for each thread in order to record 
the time at which a thread has entered a given code section (active code section). The AHD 
performs a periodic check of these timestamps at a fixed interval. If a timestamp expires, the 
corresponding thread is flagged as hung. Similar to system hang detection, the fixed interval must 
be sufficiently large to avoid false alarms, e.g., 5 seconds.13 Note that a typical thread may not be 
scheduled when it executes a function with indefinite waiting time, such as accept(), select(), or 
recv() in TCP. To avoid false alarms, these functions should not be included in a monitored code 
section.  
Determining check value. The check value for a code section is derived from the execution 
history of the section, similarly to the check-value determination in SHD. However, a code 
section may have multiple exits, and the counts of instructions for differently exiting executions 
may vary widely. An example is a switch-case block with a different computation task, and a 
different exit for each case. AHD maintains a history of instruction counts for executions 
                                                        
13
 A thread scheduling may be delayed due to a heavy system load. 
 32 
associated with each exit, and the check value for the code section is derived based on the history 
with the largest instruction counts. A threshold based on heuristics obtained from application 
profiling is applied to avoid selecting small check values derived from insignificant cases of code 
section execution (e.g., a web server sends a response to a client without sending web pages, 
because the client has an unexpired copy of the pages). 
3.2.2. AHD Implementation 
The AHD module employs services provided by four RMK pins: P_SCHL, P_SYSC (system call), 
P_PERI (periodical task), and P_PMC, in a way similar to the SHD implementation. AHD 
exploits compiler-provided information on code sections to detect application hangs. When 
generating binary code for a program, the compiler adds two function calls, section_enter(section 
id) and section_leave(section id, exit id), at the entries and exits of the monitored code sections, 
respectively. (The compiler may select only coarse-granularity code sections for monitoring, like 
the outer loops or functions.)  A new system call, sys_section_action(), is added to service these 
two functions.  
Only one hardware counter is used in implementing the counter array in Figure 8. For each 
monitored process, the AHD maintains three kinds of tables: a section table recording all the 
monitored code sections, a thread-execution table for each thread column in Figure 8, and an 
instruction-count history table associated with each exit of a code section. The AHD algorithm is 
outlined as follows:  
 Upon a context switch: 
 Read the hardware counter for the number of instructions executed during the last 
time-slice, and reset the counter to zero. 
 Add the read number to the current count values for all the active sections in the thread. If 
the sum for a section is larger than the section’s check value, indicate a hang. 
 33 
 Set the timestamp for the thread to the current system time. 
 Upon an instruction retirement: 
 The hardware counter increments by 1. 
 On section_enter(section id) invocation: 
 Set the active bit for the section and create related table entries if they are not present. 
 Read the hardware counter’s value c, and set the current count value of the section within 
the thread to -c (so that the adding operation performed in the next context switch 
generates the correct value). 
 On section_leave(section id, exit id) invocation: 
 Read the hardware counter’s value, and adds it to the current count value of the section 
within the thread. 
 Append the sum as a piece of count history in the path table for the section and the exit. 
Then compute the new check value14 according to the count history for the exit with the 
largest average instruction count.  
 Clear the active bit, and remove unused table entries. 
 Periodically perform the following check at a fixed interval:  
 Check the timestamps for all the threads of the process associating with at least one active 
section. If a timestamp expired, indicate a hang. 
 
3.3. Transparent Application Checkpoint Module (TAC) 
Existing checkpointing schemes for individual applications usually take a snapshot of the 
application process image. However, as an application may have issued I/O operations that are 
pending in the system, and not finished by the checkpointing time, the snapshot of the process 
image, if restored upon a failure, may be inconsistent with the I/O status in the system. File 
reading/writing and network communication are examples of the I/O operations. Previous 
application checkpointing schemes, e.g. Plank’s libckpt [9], rely on user-level knowledge to 
decide when to take checkpoint for addressing consistency with I/O state, OS-issued signals, and 
message queues. This method requires the program developer to have enough system-level 
expertise to handle the consistency issue. 
The TAC applies an approach similar to libckpt to incrementally checkpoint the dirty memory 
pages. However, in doing so, TAC: (i) exploits the OS-level knowledge (collected by RMK pins) 
                                                        
14
 As checking is not performed here, the sum may be larger than the current check value with the hang notification 
not raised.  
 34 
on the application execution; and (ii) decides (transparently to the application) when and how to 
checkpoint so to avoid any inconsistency between the application image and the system state. In 
addition, synchronization between threads in a multi-thread process and consistency of process 
checkpoints in a distributed application can also be addressed transparently with OS-level 
knowledge (e.g., messages can be logged by invoking appropriate operations of the P_NET pin). 
In addition to the generic approach to deciding when and how to checkpoint, checkpointing 
mechanisms can be customized to achieve even higher performance if the application execution 
characteristics are known. This section discusses how TAC addresses the consistency issue in a 
generic checkpointing method, and how the same approach can be employed to provide a custom 
application checkpointing.  
3.3.1. Generic Checkpointing 
Each application process has its kernel state for correct execution of the process; and a transparent 
application checkpoint must address the consistency between the recovered application process 
and the kernel state. For example, Figure 9 illustrates a breakdown of the address space of a 
process execution in Linux. The kernel state of the process includes the process info, kernel stack, 
memory tables, signal/IPC status, wait queues, and tables for opened files. The process info (in 
Figure 9) contains process attributes (process ID, schedule priority), process relations (parent, 
child), and user information (uid, gid). The wait queues deal with the pending asynchronous I/O 
operations, and process/thread synchronization. The user state of the process consists of memory 
pages constituting the segments of code, data, heap, stack, and dynamically loaded libraries.  
 35 
kernel 
memory
e.v. & arg
stack
heap
data
code
BSS
kernel stack
process info
mem tables
file tables
wait queues
signals, pipes
kernel code & data
kernel state of 
other processes
kernel state of a process
page 
table
library
 
Figure 9: Address Space of a Process in Linux  
The checkpoint taken by the TAC module includes the modified user memory pages and the 
processor state. The TAC module eliminates the need to checkpoint all kernel states of the process 
by taking a checkpoint only when the kernel state is in a safe point, i.e., no pending I/O, signals, or 
IPC messages in the context of the application process. Only the opened files table (names, and 
current seek offsets of all open files) and the memory table in the kernel state of the process are 
checkpointed. When the process is recovered upon a failure, the file table and memory table are 
restored in the process kernel state with no pending I/O, signals, or IPC messages; and with an 
empty kernel stack. Consequently, after restoring the checkpointed user memory pages, the 
recovered process is in a consistent state. The steps given below outline steps in the TAC 
algorithm for application checkpointing.  
1. Periodically initiate a checkpoint by setting a chkpt_started flag to 1. 
2. Check whether there is a pending I/O operation, signal, or IPC message in the target process. 
If TRUE, go to sleep; otherwise, go to step 6. 
3. Upon completion of an I/O operation (or signal processing, or IPC message processing), 
check the chkpt_started flag (this is done using functionality provided by P_FILE/P_NET 
pins). If the flag is set, the pin raises an event EVT_IO_COMP/EVT_NET_COMP to notify the 
TAC module. 
 36 
4. Upon initiation of an I/O operation by the process (the P_FILE/P_NET pins), check the 
chkpt_started flag. If the flag is set, postpone the I/O operation. New IPC messages (check by 
the P_IPC pin) are handled in similar fashion as I/O operations. New arriving signals (check 
by the P_SIG pin) are allowed to be immediately processed. 
5. Upon notification of new event EVT_*_COMP, go to step 2. 
6. Checkpoint the application process (there are no pending I/O, signals, or IPCs): (i) set all the 
user memory pages of the process as read-only (using the P_MEM pin), (ii) save the names 
and current seek offsets of all open files, and  (iii) preserve memory table and CPU state.  
7. Clear the chkpt_started flag and wait for the duration of the checkpoint interval to initiate the 
next checkpoint. 
8. Upon a write to a read-only user page in the process, handle a segmentation fault signal 
raised by the system, and captured by the P_SIG pin, which raises an event EVT_SIG_SEGV 
to notify the TAC; duplicate the page in backup memory, hosted by a user-level dummy 
process, and enable writes to the page using the P_MEM pin.  
 
Comparison with shadow-process checkpointing. A shadow process [10] may be forked for 
checkpointing purposes. The shadow process is inactive and only stores the process image. 
Memory pages are shared between the two processes with the copy-on-write mechanism applied. 
When the original process fails, the shadow process is started. The TAC has two advantages over 
the shadow-process method. (i) TAC preserves process states such as process ID and process 
parenthood. Preserving process ID is crucial for PID-aware applications like the Apache web 
server. (ii) The shadow-process checkpoint kills the old shadow process and forks a new one 
during each checkpoint. This incurs a fairly large overhead, including allocation, and deallocation 
of process resources, and memory (several milliseconds). 
3.3.2. Custom Checkpointing for the Apache Web Server 
When the application characteristics are taken into account, the generic checkpointing scheme can 
be customized to achieve better performance. We demonstrate this by deploying a custom 
checkpointing scheme for the Apache web server, a request-transaction-based application. Apache 
consists of a parent server process and multiple child server processes, as depicted in Figure 10(a) 
[11]. 
 37 
 
0. init
1. mutex
held
2. wait
request
3. req.
received
4. send
reply
Acquire mutex
Accept connection
release mutex
Receive request
Process http request
finish reply
connection
expire
Parent Server
Child server Child server Child server...
(a) (b)
Checkpoint 0
Checkpoint 1
 
Figure 10: (a) Process Architecture of Apache, (b) State Transition Diagram of the Child 
Server 
The parent server manages the pool of child servers through their process ID, and the child server 
performs the web request processing. After a connection from a client is accepted, the child server 
receives web requests from the client, processes the requests, and sends back replies. If a 
connection expires when there is no activity during a pre-specified period, the server waits for the 
next connection. A state transition diagram15 in Figure 10(b) depicts the execution flow of an 
Apache child server process.  
When a child server crashes, the connection and the ongoing data transfer are broken: and the web 
page failure is indicated to a user. If the user initiates the request again, data must be retransmitted. 
This incurs a waste of network bandwidth.  
The Apache child server state does not depend on the processing history. This feature leads to the 
custom application checkpointing scheme, which takes only two checkpoints of Apache, once for 
                                                        
15
 Here we only consider processing basic web requests (e.g. static webpage retrievals). More complicated web 
applications, e.g. e-commerce, may invoke application servers and databases, which are beyond this dissertation’s 
focus. 
 38 
each checkpoint throughout the entire life of the application, and logs the web request and the 
amount of transmitted data of the incomplete reply. During a failure recovery, one of the two 
checkpoints is restored with the connection preserved (the checkpoint to restore depends on what 
state the application is in when the failure occurs), and the logged web request is replayed. 
Process ID is preserved during the recovery. Because each checkpoint is taken only once 
throughout the application’s life, the runtime overhead of the scheme is negligible.  
The details of the transparent custom checkpointing scheme are illustrated in Figure 10(b). 
Checkpoint 0 is taken the first time the mutex is acquired in state 0; checkpoint 1 is taken the first 
time a request is received in state 2. When the application fails in state 0 or state 1, checkpoint 0 is 
restored, and the mutex is released if it is still held. When the application fails in states 2, 3, or 4, 
checkpoint 1 is restored, and the logged request, if there is one, is replayed. During a reply, the 
part of the reply already sent to the client is not retransmitted.  
Implementation. Four pins are employed in the custom checkpointing mechanism for Apache: 
P_MEM (memory image dumping/restoration), P_FILE (recording the information of opened 
files), P_NET (intercepting network operations for taking checkpoint 1, logging request, 
replaying request, and preventing transmission of data already sent), and P_SYSC (system calls 
for mutex manipulation, including the hook for taking checkpoint 0). The checkpointing, and 
recovery are performed without any instrumentation to the application by employing the 
operations/events provided by these pins. 
3.4. Experimental Evaluation 
The RMK is implemented as a loadable kernel module in Linux 2.6.11, and as a device driver in 
Windows XP Professional SP2, on a Pentium 4 processor (1.5 GHz). The RMK pins are 
 39 
implemented as part of the RMK driver by taking advantage of the kernel functions, and variables 
exported from the operating systems, as well as the monitoring/debugging capability of the 
modern processors. Experiments are conducted on the two RMK implementations to evaluate its 
effectiveness, and performance overhead. NFTAPE [12], a software toolset, is used to launch fault 
injections, and assess the effectiveness of individual RMK modules in terms of their error 
coverage. We mainly present the experiment results for the RMK on Linux here because the 
availability of Linux source code allows us to instrument the Linux kernel for in-depth 
observation of the system behavior during experiments. 
3.4.1. Evaluation of System Hang Detection 
Experiment setup. To assess effectiveness of the system hang detection, we need to create 
operational conditions that are likely to lead to system hangs. Toward this goal, a victim device 
driver (CRC-driver), which calculates cyclic redundancy code, is implemented, and attached to 
the operating system. At runtime, faults are injected into the CRC-driver to induce system hangs. 
A synthetic application invokes the driver to compute CRC on a selected data set, and to generate 
sufficient system load to enable activation of errors injected into the CRC-driver. The results from 
the experiments are collected for offline analysis. 
Outcome categories. Outcomes from error injection experiments are classified according to the 
categories given in Table 3. 
 40 
Table 3: Outcome Categories 
Outcome 
Category 
Description 
Correct Output The application produces correct output. The error may not be activated, or 
activated but not manifested. 
The OS performs normally during the experiment period (15 s). 
Fail Silence 
Violation 
The application produces incorrect output. 
The OS performs normally during the experiment period (15 s). 
Kernel 
Exception 
The application terminates abnormally. 
The OS raises an exception: invalid kernel paging request, kernel NULL 
dereference, general protection, invalid operand, or others. 
Silent Hang The application does not complete. 
The system hangs without reporting any exception. 
Results. Table 4 gives the distribution of fault injection outcomes. Strictly speaking, as errors 
propagate within the system, a single fault may trigger multiple exceptions; the first observed 
exception or outcome is reported in Table 4.  
Table 4 indicates that 9.0% of injected errors lead to system hangs, and that the SHD (System 
Hang Detection) module detects all hangs observed in our experiments. Table 5 gives five 
examples of error propagations leading to system hangs. We can see that a propagated error may 
cause different kinds of kernel exceptions (Kernel NULL Dereference, and Invalid Kernel Paging 
Request are caused in example 3). Furthermore, critical data structures in the OS can get 
corrupted due to the error propagation (Global Descriptor Table (GDT), and Task Stack Segment 
(TSS) in the system are corrupted in examples 4, and 5). Surprisingly, kernel exceptions can be 
triggered repeatedly (e.g., spin_lock_unreleased is reported more than 30 times before GDT is 
corrupted in example 5).  
 41 
Table 4: Error Injections in the Victim Driver (1346 Experiments) for Linux 
First Observed Failure Outcome Number of 
Manifestations 
Number of 
Hangs 
Number of 
Detected 
Hangs 
Correct Output 391 (29.0%) 1* 1 
Fail Silence Violation 380 (28.2%) 0 0 
Kernel Exception: Invalid Kernel 
Paging Request 
249 (18.5%) 13 13 
Kernel Exception: Kernel NULL 
Dereference 
93 (6.9%) 6 6 
Kernel Exception: General Protection 66 (4.9%) 0 0 
Kernel Exception: Invalid Operand 62 (4.6%) 0 0 
Kernel Exception: Other 15 (1.1%) 11 11 
Silent Hang 90 (6.7%) 90 90 
Total Failed 956** (71.0%) 121 (9.0%) 121 
*In this experiment, the application produces correct results; however, the error propagates, and 
causes an invalid kernel paging request, which further leads to the system hang. 
**This number includes the case marked with *. 
 
Table 5: Examples of Error Propagations Leading to System Hangs 
Example First Observed Failure 
Outcome 
Failure Propagation Trace 
1 Invalid Kernel Paging 
Request 
Invalid kernel paging request -> Invalid kernel paging 
request -> Kernel panic -> hang detected 
2 Invalid Kernel Paging 
Request 
Invalid kernel paging request -> Kernel NULL 
dereference -> hang detected 
3 Kernel NULL Dereference Kernel NULL dereference -> Invalid kernel paging 
request -> Invalid kernel paging request -> Kernel 
panic -> hang detected 
4 Other gdt double fault -> tss double fault -> hang detected 
5 Other spin_lock unreleased -> spin_lock unreleased ->…-> 
spin_lock_unreleased -> gdt double fault -> tss double 
fault -> hang detected 
SHD detects a hang after a predetermined number of instructions executed without a context 
switch, then reboots the system. Therefore, it is not known how long after the hang detection the 
system remains in a non-operational state. Six experiments randomly selected from the 90 silent 
hangs are analyzed to determine what happened to the system after the hang detection. In two of 
the six experiments, the system hangs for 2 seconds, and then reports an invalid kernel paging 
 42 
request; in another two, the system hangs for 10+ seconds before another failure is reported; and 
in the remaining two experiments, the system hangs for a time longer than the observation period 
(4 minutes). The analysis results show that, without SHD, the system undergoes a long period of 
invalid execution. It may finally trigger a kernel exception, or not trigger an exception but 
continue hanging. This study also indicates that, although the OS provides checking for erroneous 
instruction executions (in 455 of 1346, or 33.8% of the experiments, the injected errors are 
detected by the system itself), the SHD of RMK provides a complementary solution for detecting 
erroneous instruction executions with a bounded latency. 
Experiment results on windows. Experiments with the same setup described above are 
conducted for the RMK implementation on Windows XP, and the results are reported in Table 6. 
Note that there is a difference between the terminology used in Table 3 and Table 6 to define 
outcome categories. As we have access to Linux source code, we instrument the Linux system to 
study error propagation (see Table 5), and the terminology in Table 3 is based on inside 
knowledge of kernel behavior (e.g. kernel exception). For the Windows system, analysis of error 
propagation is not possible due to the lack of the access to the Windows source code. Therefore, 
failure outcome categories are defined slightly different from those shown in Table 3 (there is no 
kernel exception, and system hang replaces the silent hang). 
Table 6: Error Injections in the Victim Driver (802 Experiments) for Windows XP 
Outcome Category Number 
Correct Output 218 (27.2%) 
Fail Silence Violation 94 (11.7%) 
System Crash 476 (59.4%) 
System Hang 14 (1.7%) 
Most of the experiments (59.4%) result in system crash (i.e. the Windows system reboots), 11.7% 
of the experiments result in fail silence violation, and system hangs are resulted in only 1.7% of 
 43 
the experiments. Compared with the experiment result in Linux (Table 4), there are many more 
system crashes while less system hangs. This difference is because the default behavior of kernel 
exception handlers in Windows reboots the system, while the default behavior of kernel exception 
handlers in Linux allows the system to continue, which can lead to a system hang.  
3.4.2. Evaluation of Application Hang Detection 
Experiment setup. The web server Apache 2.0.55 on Linux is the target application for fault 
injection to evaluate the AHD (Application Hang Detection). A website consisting of hundreds of 
pages and files is set up on the target machine. A web client launches multiple requests from 
another machine. Because the application consists of tens of thousands of lines of code, fault 
injection is conducted into the main loop of the server (target code), which accepts web requests, 
and sends back replies (illustrated in Figure 10(b)). Three code sections are identified in the target 
code: initialization (state 0, and transition to state 1 in Figure 10(b)), connection acceptance (state 
1, and transition to state 2), and request processing (states 2, 3, 4). The instruction counts of 50 
recent executions are used to determine the check value for each code section.  
Results. Three fault injection campaigns, listed in Table 7, are launched to evaluate the AHD. In 
each experiment of the three campaigns, the AHD is trained with 50 web requests to learn the 
initial check value. The threshold for determining the check value in all the three campaigns is 
selected as 1.5 million instructions (equivalent to a 1ms-duration execution of the processor) to 
avoid small check values derived from trivial cases of code-section execution.  
 44 
Table 7: Experiment Campaigns for Application Hang Detection 
Campaign Injected Fault Experiments Crashes Detected Hangs 
1 A bit flip in the target code 1861 1854 (99.6%) 7 (0.4%) 
2 A hang in the target code 1206 0 1206 (100%) 
3 None 2000 0 1 
Bit flips are injected into the target code in Campaign 1. Table 7 shows that the application is very 
likely to crash (99.6%) in the presence of bit errors. Only seven of the injections lead to hangs. As 
seven hangs are not sufficient to evaluate the detection coverage of the AHD, hangs are explicitly 
injected into the target code in the second campaign. A hang is injected by activating an infinite 
loop embedded in the application. The AHD module detects all hangs injected in our experiments.  
Campaign 3 was launched to measure false positives, and hence no faults are injected in this 
campaign. In the 2000 experiments, only one raises a false positive. The false positive was raised 
because a very large file was requested following a number of requests for small pages/files. The 
transmission of this large file requires many more instructions be executed within a code section 
(2726433 instructions are executed for the request for processing, while the largest instruction 
count for the previous 50 requests for processing is 449493). The false positive may be avoided 
with a longer monitoring history (e.g. 100 requests), or a higher threshold.16 
3.4.3. Performance Overhead of RMK, SHD, and AHD 
Experiment setup. Experiments are conducted to study the performance overhead of the RMK. 
Two machines are used in the experiments: one as the web server, and the other as the web client. 
The Apache 2.0.55 runs on the server machine. The web clients on the client machine are 
launched by WebStone [13], a benchmark program for web servers. WebStone creates a load on a 
web server by simulating the activity of multiple web clients on one or more machines. In our 
                                                        
16
 Prior knowledge of the size of the requested file/page can be exploited to reduce false positives. 
 45 
experiments, WebStone starts 20-100 simulated web clients on the client machine. Three sets of 
experiments are carried out: baseline (RMK not deployed), RMK+SHD (RMK with the SHD 
module deployed on the server machine), and RMK+AHD (RMK with the AHD module 
deployed to detect hangs of the Apache server). No fault injection is performed in the 
experiments.  
Results. Figure 11 gives the server throughput (amount of data transmitted per second) and 
average response time for a request as a function of number of clients. One can see that the 
throughput does not change with the number of clients, and the average response time increases 
linearly with the number of clients. This is because, the web clients launched by WebStone send 
requests to the server continuously, and the server processes the requests at its full throughput, 
which is a fixed value.  Because the server’s processing capability is fixed, the request response 
time gets larger when more clients send requests.  
Figure 11 shows that the performance overheads of RMK, SHD, and AHD are very small. The 
actual throughput (with the 95% confidence interval) measured for the three scenarios in Figure 
11(a) is 6.51±0.04 ms for the baseline, 6.51±0.07 for RMK+SHD, and 6.47±0.02 for RMK+AHD.  
0
2
4
6
8
20 30 40 50 60 70 80 90 10
0
number of clients
th
ro
u
gh
tp
u
t (
M
B
/s
)
baseline
RMK+NLD
RMK+AHD
 
(a) Throughputs of the web server 
0
50
100
150
200
250
300
350
20 40 60 80 10
0
number of clients
re
qu
es
t r
es
po
n
se
 
(m
s)
baseline
RMK+NLD
RMK+AHD
 
(b) Average response time for a request 
Figure 11: Performance Overheads of RMK and SHD/AHD Modules 
 46 
The overhead of SHD is negligible because instruction counting is performed in hardware, and the 
context switch frequency is not high. AHD incurs an overhead of about 0.6% of the baseline 
performance. The AHD overhead is due to (i) system call invocations at the code section 
entry/exit points, (ii) updates of count values at context switches, and (iii) periodic timestamp 
checks. Concerning the low frequencies of context switches and timestamp checks, (ii) and (iii) 
incur negligible overheads. If code sections are entered and exited frequently, the overhead in (i) 
is noticeable, but still small. 
3.4.4. Performance Overhead of TAC 
Three applications are used to evaluate the performance overhead of the generic checkpointing in 
TAC (Transparent Application Checkpoint) module: (i) gzip, a SPEC2000 benchmark, (ii) ns2, a 
popular network simulator, and (iii) Apache, a web server. Because the TAC module aims at 
providing high-efficiency checkpointing for computation-intensive applications, gzip and ns2 are 
selected as target applications in the experiments. Apache is included to contrast the generic 
checkpointing scheme with a custom checkpointing approach. Results show that TAC incremental 
checkpointing has a 0.1% performance overhead, and gains a 91% performance improvement 
against the memory-dump approach (dumping all process pages to backup memory) for a 
large-memory application. These results are consistent with the measurements in libckpt [9] (85% 
of overhead reduction), and libFT [14] (0.1% performance overhead). 
3.4.4.1. Results for Generic Checkpointing 
As an in-memory checkpointing scheme, the TAC has a relatively small performance overhead, 
and more frequent checkpointing can be applied to avoid a large loss of work due to application 
failures (the checkpoint interval for the experiments is 2 seconds). Table 8 summarizes the TAC 
 47 
performance overheads (the 95% confidence interval) and compares the results with the 
performance of the memory-dump approach. 
Table 8: TAC Performance Overhead During a Checkpoint Interval (2 Seconds) 
TAC Checkpoint Overhead (ms) Applicati
on 
Data Pages, n Written 
Pages, p * Setting 
pages as 
read-only, c 
Handling a 
page fault, 
s+r 
Overall, 
TAC_ov
hd 
Memory
-Dump 
Overhea
d (ms) 
gzip 111 (data: 2, 
bss: 83, heap: 
0, 
e.v.+arg+stack: 
21, lib data: 5) 
29±6 (data: 
1, bss: 27, 
heap: 0, 
stack: 1, lib 
data: 0) 
0.0083±0.00
43 
0.0344±0.00
25 
1.01 1.94±0.1
6 
Apache 321 (data: 7, 
bss: 3, heap: 
203, 
e.v.+arg+stack: 
21, lib data: 
87) 
32±8 (data: 
5, bss: 1, 
heap: 17, 
stack: 3, lib 
data: 6)  
0.0191±0.00
36 
0.0344±0.00
25 
1.12 3.10±0.1
5 
ns2 1426 (data: 
158, bss: 8, 
heap: 1213, 
e.v.+arg+stack: 
21, lib data: 
26) 
34±6 (data: 
1, bss: 1, 
heap: 31, 
stack: 1, lib 
data: 0)  
0.1182±0.00
34 
0.0344±0.00
25 
1.29 15.08±0.
10 
*The breakdown information of written pages listed in the column is collected from a sample 
page-access trail in a checkpoint interval. 
Table 8 lists the numbers of data pages, and pages written during a checkpoint interval for the 
three applications. The breakdown information (in parenthesis) of pages accessed during a 
checkpoint interval provides insights into the executions of the applications. For example, gzip 
accesses a total of 111 user-space data pages, including initialized static data (data), uninitialized 
static data (bss), dynamic data (heap), stack, and data of shared libraries. Apache uses 321 pages, 
and ns2 uses 1426 pages, the largest number among the three benchmarks. This is because gzip 
uses a streaming mode in data compression (read a block – compress – write result – read next 
block – …), and a small amount of fixed-size memory is sufficient (gzip has no heap). In the 
 48 
conducted experiments, ns2 simulates a network transmission protocol in a complex network 
configuration and explores a large state space in the simulation, so the memory requirement is 
high. 
Due to space locality, the number of memory pages written by an application within a short period 
is often far less than the total page number. In our experiments, around 30 pages are written in the 
three applications during a checkpoint interval, though the total page numbers range from several 
hundred to several thousand. 
TAC checkpointing involves two activities: (i) periodically setting the entire process memory to 
read-only (the associated overhead is denoted as c in Table 8); and (ii) upon a page fault, raising 
the page fault exception (the overhead denoted as s), granting write access to the targeted page, 
and replicating the page (denoted as r). The TAC performance overhead is computed by adding 
the overheads of the two activities. Let p denote the number of page faults in a checkpoint interval. 
Then the TAC checkpoint overhead during a checkpoint interval is 
TAC_ovhd = c + (s+r)*p . 
The measurement in the experiments shows s = 24.55 ± 2.02 µs, and r = 9.86 ± 0.46 µs (with the 
95% confidence level). Because s measures triggering a page fault as well as delivering a signal, 
and r measures copying one page and granting write permission for the page, s and r have the 
same values for the three benchmark applications. The values of c are different for the three 
applications because different numbers of memory pages (denoted as n) are set as read-only. The 
TAC_ovhd values listed in Table 9 are computed using the corresponding average values of c, s, r, 
and p. 
 49 
Comparison with shadow-process checkpointing. Both TAC checkpointing and 
shadow-process checkpointing have overheads computed by c+(s+r)*p, but the c in 
shadow-process checkpointing is much larger, as a process forking and a process termination are 
included (for the Apache application, the overhead associated with the process forking and 
termination is 1.75 ± 0.05ms, much larger than the 0.019 ms in TAC). 
Comparison with Memory-Dump Checkpointing. The overall overheads for the three 
benchmark applications, 1.01 ms, 1.12 ms and 1.29 ms, are less than 0.1% of the checkpoint 
interval (2 seconds), and justify the selection of checkpoint intervals of several seconds. Table 8 
(the rightmost column) provides comparison of the TAC checkpoint overhead with the overhead 
of the memory-dump method. One can see from the table that TAC checkpointing has a greater 
performance improvement than memory-dump checkpointing, especially for applications with 
large memory (1.29 ms versus 15.08 ms, a 91% improvement for ns2). This is because TAC 
effectively exploits the space locality of memory accesses performed by applications.  
3.4.4.2. Experiment Results for Custom Checkpointing for Apache 
Experiments are conducted to evaluate the custom checkpointing for Apache with the support of 
TAC. The performances of three checkpoint/recovery approaches are compared in the 
experiments: (i) a simple restart solution, (ii) memory-dump checkpointing, and (iii) custom 
checkpointing for Apache. The comparison in terms of the runtime checkpoint overhead and 
recovery time is summarized in Table 9 (with a 95% confidence level). 
 50 
Table 9: Comparison of Three Checkpoint/Recovery Approaches 
Approach Description Checkpoint/Logging 
Overhead (ms) 
Recovery 
Time (ms) 
Simple Restart The web server is restarted upon 
failure. No checkpointing. 
N/A 42.02 ± 0.13 
Memory-Dump The process image is dumped to 
backup memory.  
3.10 ± 0.15 4.05 ± 0.08 
Custom 
Checkpointing for 
Apache in TAC 
Two checkpoints are taken only once 
during execution. Information 
logging is performed at runtime. 
Not measurable* 4.07 ± 0.07  
*
 The time-measuring tool provided by the system is not able to measure so small an overhead, 
and it reports 0. 
From Table 9, one can see that if there is no checkpointing provided for Apache, the recovery time 
is large (42.02 ms), which greatly degrades web service availability. The recovery times for 
memory-dump checkpointing and custom Apache checkpointing are similar (around 4 ms), as 
both recover the application from in-memory checkpoints. But the runtime overhead of the 
custom Apache checkpointing is very small (not measurable in the system). 
 
 51 
CHAPTER 4 
FORMALIZING SYSTEM BEHAVIOR FOR EVALUATING SHD  
 
4.1. Introduction 
Operating systems are subject to hangs due to, e.g., poorly written drivers with blocking 
operations, and unreleased locks. A number of techniques have been proposed for low-latency and 
low-cost hang detection, e.g., SHD in RMK [15], KHM [16], hardware watchdog [17], and Linux 
NMI watchdog [18]. These techniques are either not systematically evaluated ([16], [17], [18]) or 
are evaluated through injection of memory or processor errors ([15]) in the program execution. 
Random fault injection enables assessing coverage and latency figures for the built-in hang 
detection mechanisms. However, due to its statistical nature, even extensive error injection 
campaigns cannot guarantee uncovering all hang cases that escape detection.  
This chapter proposes a formal approach to verify the coverage of a system hang detector, SHD, 
in particular to expose the system hang scenarios that escape detection. The Linux system with the 
integrated SHD is abstracted using a formal model developed for the purpose of evaluating the 
SHD. The state of the model is described in terms of the computation performed by the system (i.e. 
execution of relevant hardware/software components such as hardware timer, user threads, system 
calls, and interrupts), rather than in terms of its structural components (e.g. the cache or the 
register values). As a result, the system behavior in the model is represented as an execution flow 
of interacting and interleaving execution of the software components. The execution flow is 
modeled at a granularity of a basic computation unit (BCU), i.e., uninterrupted execution of code 
 52 
in a specific computation context. In this sense, the user-mode execution of the program between 
any consecutive interruptions (either due to a system call or an interrupt) is regarded as a BCU. 
Note that a BCU is not determined by the source code, but rather by the dynamic execution. For 
example, the user code execution between two consecutive system calls is regarded as one BCU if 
the execution is not interrupted by an interrupt; and the same user code execution, if interrupted 
by an interrupt in the middle, is regarded as including two BCUs. The unique part of this work is 
that, at this stated level of abstraction, we are able to enumerate all interleaved execution 
corresponding to actual system behavior in so far as system hangs are concerned.  
Specifically, our modeling approach consists of several steps: 
• Abstraction of the SHD and the hardware/software components in the target real-world 
system that directly interface with or are invoked by the SHD. A set of hierarchical state 
machines (derived based on the analysis of the source code semantics of the target system) is 
used to model the behavior of a software component at the granularity of a BCU. 
• Transformation of the model into a formal language representation to enable reasoning about 
the system behavior. 
• Verification of the system hang detector and identification of corner cases in system hangs 
that escape detection. The system hang is interpreted as violation of a liveness property of the 
model. Then explicit-state model checking exhaustively explores execution scenarios, 
including cases that lead to system hangs.  
This chapter contributes with: (i) an abstraction and a formal model of a Linux system with an 
integrated system hang detector, (ii) a proof that the abstraction neither misses system hang 
 53 
scenarios in the real-world system, nor incurs false positives, (iii) an algorithm for deriving all 
possible system execution flows by combining the behaviors of the system components modeled 
in hierarchical state machines, (iv) implementation of the model using the Maude formal language 
and the model checker [19], and (v) assessment of the coverage of the target system hang detector 
and identification (characterization) of execution scenarios that lead to hangs and escape 
detection.  
4.2. Related Work 
A number of models formulate system behavior as state machines in the literature and industry. 
An abstract state machine (ASM) is often applied for this purpose.17 An ASM state is a step in 
the execution of a program/system, rather than a point in the state space. Every algorithm (the 
verification target) is emulated step-for-step by an appropriate ASM. Because ASMs model 
algorithms at arbitrary levels of abstraction, this approach has been frequently applied to verify 
software and hardware systems [20], [21], [22]. Besides ASM, SPIN models processes in 
distributed systems as individual state machines to detect design errors in distributed protocols 
[23]. 
There are also approaches which verify system properties using functional language. For example, 
in [24], OS operations are represented in a functional language suited for manual proof of security 
properties. [25] represents an abstract OS using a functional language to study properties of 
system partitioning (i.e., address space, time-sharing) for providing error isolation.  
In addition to modeling system behavior based on semantics, modeling and verification are also 
conducted on the actual code of programs, libraries, and APIs. Symbolic execution falls into this 
                                                        
17
 More information about abstract state machines can be found at http://www.eecs.umich.edu/gasm/ 
 54 
category. SLAM [26] verifies the source code of device drivers for Windows by first transforming 
driver code into Boolean programs (programs manipulating only Boolean variables) and then 
applying model checking for the Boolean programs. In [27], [28], the L4 microkernel API is 
verified to ensure consistency among the API functions and the correctness of the API 
implementation. MOP [29] instruments programs with formal verification code for runtime 
program verification. A formal model is constructed for the target program, state transition in the 
formal model is embedded in the program source code, and properties of the formal model are 
checked upon state transition. JavaFAN [30] symbolically executes Java programs while 
verifying behavior properties. Pointer taintedness analysis [31] finds security vulnerabilities in 
library source code by propagating an attribute of pointer taintedness along data flows of the 
library functions.   
4.3. System Hang Detector 
The target technique, System Hang Detector (SHD), which is evaluated through formal methods 
in this chapter, enables low-latency, transparent OS-hang detection; it has been introduced in [15]. 
SHD is designed and implemented as a light-weight kernel driver. Compared with other system 
hang detection techniques, SHD requires neither extra hardware devices, nor instrumentation of 
the kernel source code.  
The underlying fault model for a system hang assumed in SHD is that an operating system in a 
hang state does not relinquish the processor and does not schedule any processes. The detection 
principle underlying the SHD is based on a generic observation that in a system executing a set of 
equal-priority tasks, the number of instructions executed between two consecutive context 
switches is finite (or bounded). A system hang causes the instruction count to grow beyond the 
 55 
predetermined (or measured) bound.  
The SHD uses a hardware performance counter (a register available in most current generation 
processors) to count the number of instructions executed by the processor. The SHD 
implementation intercepts the scheduler by replacing the first instruction of the scheduler with a 
jump instruction to redirect the execution to the SHD code. Upon a context switch, the SHD code 
resets the hardware performance counter to a preset value (in-depth discussion on determining the 
value can be found in [15]). The counter decrements as an instruction executes.  
During normal behavior of the system, the counter is regularly reset. Upon a system hang, the OS 
continues to execute instructions without relinquishing control to application processes, i.e., the 
context switch is no longer invoked. Thus, the counter is not reset, and it finally reaches zero. At 
that moment, the counter raises a Non-Maskable Interrupt (NMI) to the processor, which indicates 
a system hang.  
4.4. System Abstraction and Modeling 
To evaluate the SHD, an abstract model of a typical Linux system is created to thoroughly 
exercise all execution scenarios that may lead to hangs. The goal is to expose cases (i.e., hang 
scenarios) that escape detection by the detector. Our system model abstracts the basic hardware 
(e.g., timer, hardware counter) and software (e.g., processes/threads) components of a typical 
uniprocessor system (with the SHD integrated) and enables: (i) capturing behavior of these 
components to model possible execution scenarios that may lead to system hangs, and (ii) 
evaluating the hang detection coverage of SHD.  
 56 
Modeled components. A typical system may concurrently run hundreds of processes/threads that 
perform a variety of operations (e.g., system calls, kernel-level tasklets, and invocation of device 
drivers) while potentially accessing multiple hardware devices (e.g., hard drives, or network 
adapters).  
To reduce the complexity of our model while still achieving the objective of formally evaluating 
the SHD technique, we model only the behavior of the components that directly interact with or 
are invoked by the SHD. Here “directly interact with/invoke” means that in an execution flow of 
the system behavior, the BCUs of these components execute either immediately before (i.e., the 
binary code immediately preceding the SHD code) or immediately after (i.e., the binary code 
immediately following the SHD code) the invocation of the SHD. These components constitute 
the system interface to the SHD.  
By modeling only the behavior of the components interfacing with the SHD, we essentially 
assume that any system hang manifests at these interfaces and leads to detectable (or observable) 
changes in the behavior of the modeled component(s). Consequently, observing and analyzing the 
behavior of these components is sufficient to expose hang cases that escape detection. The 
assumption is proved in Section 4.4.2. 
4.4.1. Components in the Modeled System  
Source code is needed for obtaining an accurate model of the system behavior semantics. Analysis 
of the Linux source code and the code for the SHD implementation identifies the software and 
hardware components interfacing with SHD (note that the kernel source code is not needed to 
implement and deploy the SHD in the real system).  
 57 
The scheduler and the hardware counter. The SHD implementation intercepts the scheduler. 
Upon a context switch, the SHD resets a hardware performance counter to a preset value. The 
hardware counter decrements as instructions execute and reaches zero after a system hang occurs.  
Return from kernel to user. To model the behavior of the scheduler and the hardware counter, we 
need to find the components that interface with the scheduler. The scheduler (in Linux 2.6) is 
invoked only when the system execution flow leaves kernel mode for user mode. In Linux 
implementation, a snippet of kernel code performs the return to user mode. This code snippet 
processes pending signals, IPCs, and I/O responses from device drivers for the current user thread 
and checks the time slice for the current user thread to determine whether a context switch should 
be scheduled. We only consider the time slice checking in the code snippet and ignore the other 
operations, as they are not directly involved.  
Interrupt handling and system calls. As we model the execution flow which transitions from 
kernel to user mode, we also need to model the execution flow which transitions from user to 
kernel mode. As a result, interrupt handling, system calls, and exception handling should be 
modeled (in our model, exception handling is consolidated with interrupt handling because an 
exception can be regarded as a special type of interrupt).  
Timer. We need to model instruction counting by the hardware performance counter. As hundreds 
of millions of instructions are executed during a time slice in modern processors, it is not feasible 
to model counting of every single instruction. Instead, we model the time corresponding to 
instructions’ execution at the granularity of the timer interrupt interval. Therefore, the timer is 
also included in our model.  
 58 
User threads. During system hang scenarios none of the user threads makes further progress. We 
model execution of user threads for observing such progress to determine whether the system is 
hung or not.  
In summary, our abstract system model includes the following components: scheduler, hardware 
counter, return from kernel to user mode, system calls, interrupt handlers, timer, and user threads. 
Among these components, the hardware counter and timer are hardware components, and the 
others are software components executed on the processor.  
4.4.2. Assumption Proof 
A system hang manifests as lack of progress in executing the user code. In our model, this 
corresponds to computation being trapped in a BCU or a set of BCUs not part of the user code, 
e.g., a system call or an interrupt. A livelock is an example of computation being trapped in a set 
of BCUs. The two lemmas below reflect the assumptions made in Section 4.4. Proof sketches are 
given here. 
Lemma I. Any system hang in the target real-world system manifests as a hang in the modeled 
system.  
Proof Sketch. Three generic types of system hang scenarios are possible in the target real-world 
system. 
Type I. The execution flow is trapped in a BCU (or BCUs) of a component (or components) which 
are also in the modeled system. In this case, the system hang manifests in the modeled system.  
Type II. The execution flow is trapped in a BCU (or BCUs) of a component (or components) 
which are not in the modeled system. Let c be the component in the modeled system whose BCU 
is the last one executed before the system execution flow gets trapped in a component (or 
components) not in the modeled system. In this scenario, the hang is observable as computation 
being trapped in component c and hence manifests as a hang in the modeled system.  
Type III. The execution flow is trapped in executing BCUs of components of which some are also 
in the modeled system while others are not. By excluding from the execution flow the BCUs of 
 59 
the components not modeled, the resulted execution flow is of Type I, and hence the system hang 
manifests in the modeled system.   
Lemma II. Any system hang in the modeled system manifests as a system hang in the target 
real-world system.  
Proof Sketch. Two generic types of system hang scenarios are possible in the modeled system. 
Type I. The execution flow is trapped in a BCU of a component in the modeled system. As the 
BCU of the component is also in the target real-world system, the system hang is observable in the 
real system. 
Type II. The execution flow is trapped in multiple BCUs of a component (or components) in the 
modeled system. Then the trapped part of the execution flow corresponds to a number of 
execution flows in the real system, including BCUs of components not modeled. For each of these 
execution flows, user code is never executed (otherwise the BCUs as part of user code are also 
executed in the modeled system, and this is not a system hang). So the system hang is observable 
in the real system.   
4.5. Modeling System Behavior 
We employ finite state machines (FSMs) to model the system behavior and enable studying the 
SHD’s capability of detecting system hangs. Before we dive into the details of the model, the key 
features of our model are summarized here: (i) the hardware components – the hardware counter, 
the timer, and the processor – are modeled in three separate FSMs to accommodate inherent 
parallelism among hardware components. Modeling the parallelism among physical processing 
units makes our approach applicable to multi-core, multiprocessor, and virtual machine based 
computing systems; (ii) the behavior of software components on the processor is modeled as a set 
of hierarchical FSMs based on analysis of the source code semantics; and (iii) individual 
components of the same type are not differentiated from the perspective of studying system hang 
behavior (e.g. thread represents all threads in the system). Note that differentiation of individual 
components may be needed to study behavior other than system hang. 
 60 
4.5.1. Modeling the Hardware Components  
This subsection describes the model for the timer and the hardware counter, as illustrated in 
Figure 12 (dashed lines indicate events across different FSMs). The timer has only one (dummy) 
state, sleep. A transition from sleep to sleep corresponds to a subsequent timer interrupt.  
The hardware counter is modeled to get decremented as instructions execute. Recall that we 
model the time corresponding to instructions’ execution at the granularity of the timer interrupt 
interval; i.e., the counter is decremented on an arrival of the timer interrupt. In Figure 12 (a), this 
is illustrated by transitioning from state i to state i+1. Without the SHD integrated into the system, 
the hardware counter is modeled as a state machine with a large (but finite) number of states 
because: (i) the counting continues all the way as instructions execute, and (ii) on an underflow 
(32-bit register), the counter returns back to the state 0. 
With the SHD integrated into the system (and our model), the hardware counter is reset by SHD 
upon a context switch (Figure 12 (b)). This behavior is represented in the model as a transition 
from a given counter state, state i, to the initial state, state 0. Since the context switch may occur 
whenever a user thread yields the CPU voluntarily, any state i may transition to state 0. In the 
current version of Linux (kernel 2.6), the default time slice is 100 ms, and a timer interval is 18.7 
ms. So in the normal behavior of the system, the hardware counter state is reset to state 0 after at 
most 6 transitions (ceiling(100/18.7)).  
 61 
Timer
sleep
HW counter
State 1
State 0
State m-1
…
State m
Timer interrupt
(a)
State i
…
 
Timer
sleep
HW counter
State 1
State 0
State n
…
hang-
detected
Timer interrupt
Counter reset
SHD in the 
system
System
(b)
State i
…
Timer
sleep
HW counter
timer-
interval
reset
timeslice
hang-
detected
Timer interrupt
Counter reset
SHD in the 
system
System
(c)
 
Figure 12: Model of the Timer and Counter 
After a preset threshold number of instructions are counted, i.e., after arrivals of a preset number 
of timer interrupts, the hardware counter underflows (reaches zero) and declares the system hang. 
This behavior is captured as the transition from a state (state n in Figure 12 (b)) to the next state 
hangdetected, which is a final state. The preset period of time for hang detection is a parameter in 
the SHD, and it is usually set larger (e.g., 1.5 times of the time slice) than the time slice to 
guarantee that the context switch occurs at least once before the preset hang detection period 
finishes.  
Further model refinement. The n+1 states (see Figure 12 (b)) in the hardware counter model 
can be collapsed into three states: (i) reset, which corresponds to the initial state 0, (ii) 
 62 
timerinterval, which depicts the state after arrival of the first timer interrupt since reset, and (iii) 
timeslice, which corresponds to the ending of a time slice. It is relatively straightforward to show 
formally that, with support of formal model checking tools, the behavior modeled using complete 
state machine (see Figure 12 (b)) and the behavior depicted by the refined model (see Figure 12 
(c)) are equivalent in terms of their abilities to expose execution scenarios that lead to system 
hangs. We use the refined model in Figure 12 (c) for evaluating the SHD.  
4.5.2. Modeling the Software Components 
A set of hierarchical finite state machines (see Figure 13; dashed lines indicate transitions to 
higher-level states; dash-dot lines indicate events) is used to model the software components. 
These FSMs are derived by analyzing the source code semantics of the software components in 
the modeled system (as described in Section 4.4.1). The user thread and the scheduler are at the 
top of the FSM hierarchy (Figure 13 (a)). The behavior of threads is further decomposed using 
low-level state machines. This decomposition process continues until the granularity of a BCU is 
reached. For example, a user thread, thread, is modeled as interleaving execution of user code, 
usercode, and system call, syscall (Figure 13 (b)). The system call is further decomposed into 
execution of kernel code (system call service), kernelcode, and the post-service code for return 
from the kernel to the user mode, returntouser (Figure 13 (c)). This decomposition process 
continues until a BCU (uninterrupted execution of a bulk of instructions), instrexec, is reached. 
 63 
(a)
(b)
(c)
(d)
(e)
(g)
(f)
usercode syscallthread
(h)
(i)
(j)
thread schedulersystem
kernelcode returntousersyscall epsilon
y
n
schedulerreturntouser
Time slice 
finished?kernelcode
epsilon
instrexecusercode epsilon
instrexeckernelcode epsilon
kernelcodeinterrupthdl epsilon
interrupthdl returntouserinstrexec Intr from 
kernel?
y
n
epsilon
epsilon
epsilon
SHDscheduler kernelcode epsilon
Counter reset
SHD kernelcode epsilon
 
Figure 13: Hierarchical Modeling for the System Behavior 
A dummy state, epsilon, is added in the FSMs to indicate the termination of the decomposition 
process of a computation scenario. For example, the epsilon state after the returntouser in Figure 
 64 
13 (c) means that, after the execution flow returns to the user mode, the system call invocation 
terminates. 
All of the FSMs for the software components in our model are illustrated in Figure 13. The 
topmost FSM is depicted in Figure 13 (a). At any time, the system is in a thread execution context 
or the scheduler is performing a context switch. Each thread executes for a period of a time slice 
unless the thread volunteers to yield the processor.  
Due to space limitations here we do not go over all of the FSMs in Figure 13, but only briefly 
describe the FSMs in Figure 13 (d) and Figure 13 (i). The returntouser executes in the kernel 
mode to check whether a thread volunteers to yield the CPU, whether a time slice is used up, and 
invokes the scheduler accordingly. The dashed lines in Figure 13 (d) (transitions to the scheduler 
state) indicate that the execution flow transitions to a state in a higher-level FSM (the scheduler 
state is the state in Figure 13 (a)). Interrupt behavior is modeled in the FSM in Figure 13 (i). In 
real-world systems, an interrupt may arrive after any instruction. We apply the coarser granularity 
of the BCU in modeling interrupts because it is sufficient for exploring the state space of BCUs 
interleaving. Note that non-deterministic behavior semantics are modeled in both Figure 13 (d) 
and Figure 13 (i). 
Processor attributes. Using the interrupt FSM (Figure 13 (i)), we model execution semantics 
where an interrupt can preempt system calls, the scheduler, or interrupt handlings. However, 
when the system disables an interrupt (or interrupts) some (or all) incoming interrupts are not 
serviced. A processor attribute, interrupt_flag, is defined in our model to capture this behavior 
semantic. Usually when an interrupt is being serviced, the interrupt routine has an option to enable 
 65 
or disable interrupts. The attribute checking or setting is not shown in Figure 13, to avoid 
cluttering the graphical representation of FSM models. 
4.5.3. Execution Flow Representation 
The execution flow of the system is depicted by a flow of traversing the defined (identifiable in 
the model) system states. A system state is defined in the context of the BCU currently running on 
the processor (processor attribute values are also included in a system state, but we ignore them in 
the following discussion). The context associated with the BCU consists of the states of the 
computational components from the topmost level of FSM hierarchy to the BCU. The 
computational components on the path from the top FSM to the BCU form 
logical-expansion-to-BCU (see Figure 14), which identifies the BCU. For example, the BCU 
thread.usercode.instrexec indicates that the processor is executing the user-level code of a thread 
that is not interrupted by an interrupt or system call.  
We should clarify two notions of state: (i) the system state, which is defined by the state of 
computation components in the logical-expansion-to-BCU, and (ii) the FSM state, which 
indicates a state in an FSM (such as the syscall in Figure 13 (b)). Figure 14 gives an example of 
computation execution flow in the case of hang-free execution scenario. 
In Figure 14, each line corresponds to a system state, and the sequence of transitions from state 1 
to state 15 represents an actual system execution flow, which consists of a sequence of executing 
BCUs. The state transitions are explained on the left of the execution flow. Both the logic 
expansion of software components into BCUs and the system state transitions are performed 
according to the FSMs in Figure 13. 
 66 
1  thread.usercode.instrexec
2  thread.syscall.kernelcode.instrexec
3  thread.syscall.returntouser.kernelcode.instrexec
4  thread.usercode.instrexec
5  thread.syscall.kernelcode.instrexec
6  thread.syscall.returntouser.kernelcode.instrexec
7  scheduler.SHD.kernelcode.instrexec
8  scheduler.kernelcode.instrexec
9  thread.usercode.instrexec
10 thread.usercode.instrexec.interrupthdl.kernelcode.instrexec
11 thread.usercode.instrexec.returntouser.kernelcode.instrexec
12  scheduler.SHD.kernelcode.instrexec
13 scheduler.kernelcode.instrexec
14 scheduler.kernelcode.instrexec.interrupthdl.kernelcode.instrexec
15 scheduler.kernelcode.instrexec
16 thread.usercode.instrexec
…
run the user code in thread1
invoke a syscall
to return to user mode
run the user code in thread1
invoke a syscall
to return to user mode
intercept scheduler by SHD
context switch
run the user code in thread2
handle an incoming interrupt
to return to user mode
intercept scheduler by SHD
context switch
handle an incoming interrupt
finish the interrupt handling
run the user code in thread3
execution
flow
Logic expansion to BCU
 
Figure 14: An Example Execution Flow of System Behavior 
In the following, the example in Figure 14 is briefly explained. In the initial state, the system is 
executing user-level code of a thread, say, thread1. Then thread1 invokes a system call and 
resumes execution of user-level code (line 4). Further, thread1 invokes another system call, and 
when the system returns to the user mode (after servicing the system call), the scheduler is 
invoked to perform the context switch. The SHD intercepts the scheduler (to reset the hardware 
counter) and then returns to the scheduler to complete the context switch (lines 7 and 8). Then 
another thread (say, thread2) starts to execute on the processor. During thread2’s execution, an 
interrupt arrives (line 10), which is handled by the system. Soon after the return from the interrupt, 
the time slice for thread2 is used up, and a context switch occurs. When the scheduler is executing 
on the processor, an interrupt arrives (line 14) and preempts the scheduler. After the interrupt is 
handled and the scheduler completes the context switch, the next thread (say, thread3) begins 
execution (line 16).  
 67 
4.5.4. System Hang Representation 
As a system hang is time-related (the system does not schedule processes for an infinitely long 
time), it cannot be represented by the system state transitions alone. The time information in the 
state machine for the hardware counter is used to represent system hangs. Figure 15 depicts an 
example execution flow for a system hang scenario. Both the system state and the hardware 
counter state are combined in the execution flow. After a thread invokes a system call, an error in 
the system call routine causes a system hang (line 2). When a timer interrupt arrives, it is handled 
by the processor (line 3). Since we use the refined model for the hardware counter (see Figure 15 
(c)), the second timer interrupt transitions the hardware counter to the state timeslice. After 
handling the second timer interrupt (line 5), the system returns to the system call and hangs here 
(line 6). The next timer interrupt transitions the hardware counter to the hangdetected state, which 
means that the hardware counter reaches zero, and the SHD detects the system hang. Before we 
proceed with a more generic description of the derivation of the execution flow from FSMs, we 
will discuss few additional considerations in modeling the system. 
System state HW counter
1  thread.usercode.instrexec reset
2  thread.syscall.kernelcode.instrexec reset
3  thread.syscall.kernelcode.instrexec.interrupthdl.kernelcode.instrexec timerinterval
4  thread.syscall.kernelcode.instrexec timerinterval
5  thread.syscall.kernelcode.instrexec.interrupthdl.kernelcode.instrexec timeslice
6  thread.syscall.kernelcode.instrexec timeslice
7  thread.syscall.kernelcode.instrexec.interrupthdl.kernelcode.instrexec hangdetected
 
Figure 15: A System Hang Detected by the SHD 
Representing a system hang in a formal model. A system hang in an execution flow corresponds 
to a lack of progress in executing the user threads and can be interpreted as violation of a liveness 
property of the system.  
 68 
Hangs due to design error. Some of the system hang scenarios can be due to a design error rather 
than a transient error. While our approach can still identify these scenarios, the model checker 
cannot distinguish them from system hangs due to transient errors; manual inspection of system 
hangs is required to make this distinction. 
Application hangs. The model-checking procedure explores all possible system execution 
scenarios (at the level of our modeling abstraction), including scenarios corresponding to 
application hangs. Our current approach cannot distinguish application hangs from normal 
application execution because, in both situations, user threads are continuously scheduled onto the 
processor and executed.  
4.5.5. Deriving System Execution Flows from FSMs 
System behavior, or a system execution flow, is represented as transitions between system states. 
The example given in Figure 14 also illustrates how such an execution flow is derived. Two types 
of operations are performed to derive the execution flow: (i) logical expansion to a BCU to 
represent the system state (horizontal flow in Figure 14) and (ii) system state transition along the 
execution flow (vertical flow). Any given state is first expanded to the granularity of a BCU, and 
then the system state transitions to the next system state (which needs to be expanded to the BCU 
granularity again). The expansion and the transition are performed according to the FSMs in 
Figure 13. The algorithm below specifies this derivation procedure: 
derive_control_flow(s, fsms, L) 
s: the current system state 
fsms: the set of FSMs  
L: the current sequence of transition target  
begin 
while (true) do  
expand(s, fsms, L); // expand s to granularity of BCU while maintain L accordingly; 
 69 
transition(s, fsms, L, s’); // transition the system state s to another system state s’ by using L 
properly; 
        s = s’; 
end while 
end 
A loop iteration in the algorithm transitions a system state to a next one, and the entire algorithm 
derives an execution flow of system behavior in terms of transitions of system states according to 
the input FSMs. We implement the algorithm in a model checking tool, which then exhaustively 
derives all the possible execution flows for the system behavior.  
Here we use the transition from line 1 to line 2 in Figure 14 as the example to demonstrate how 
the expansion and transition operations in one loop iteration of the derive_control_flow() exactly 
work. Figure 16 illustrates the procedure for the operations.  
(a) thread schedulersystem
(b) usercode syscallthread
(g) instrexecusercode epsilon
(i)
1
2
3
thread
System state Transition target
scheduler
usercode syscall
instrexec epsilon
epsilon epsilon
thread.usercode.instrexec (epsilon epsilon syscall scheduler)
thread.syscall
system
(scheduler)
Logic expansion to BCU
System state transition
interrupthdl returntouserinstrexec Intr from 
kernel?
y
n
epsilon
epsilon
epsilon
4
a
b
c
 
Figure 16: Deriving the Flow of System State Transition from FSMs 
Expansion to BCU granularity. Four FSMs are applied in expanding the starting state system 
into the system state at the granularity of a BCU. The System state column records the 
computation component in different FSM levels when identifying the BCU, and the Transition 
target column records the target FSM state for the next transition in different levels. The dash-dot 
 70 
lines in Figure 16 depict the expansion procedure, and the double-dashed lines illustrate the 
system state transition.  
The expansion of system starts from the topmost FSM level. During step 1 in Figure 16, the thread 
is the current computation component in the system. After the thread finishes, the scheduler is the 
next component to execute in the system according to the system FSM in Figure 16 (a) (the 
scheduler is marked with a dot in the figure to highlight the next selected component).  To reflect 
this, write the thread in the System state column, and write the scheduler in the Transition target 
column.  
During step 2, the thread FSM is used to expand thread. Following a procedure similar to that in 
step 1, the usercode is the current computation sub-component, and the syscall is the target state 
for the next transition in this level. Write usercode and syscall in the two columns on the right. 
The expansion procedure continues until the current computation component is epsilon during 
step 4. In this case, write epsilon in the Transition target column. The expansion procedure 
terminates when BCU granularity is reached. There are two options when selecting the current 
computation components during step 4: interrupthdl or epsilon. The two options indicate whether 
or not an interrupt arrived during execution of the BCU. The epsilon is selected in Figure 16. 
Exhaustive enumeration of all of the options is performed automatically by a model checker tool. 
After the four steps, the expansion result is obtained by collecting the information in the two 
columns and represents a system state thread.usercode.instrexec and a sequence of transition 
targets (epsilon epsilon syscall scheduler).  
System state transition. The system state transition is depicted by the double-dashed lines in 
Figure 16. We start with the transition target in the lowest level. It is epsilon, which means there is 
 71 
no transition target in this level, and the instrexec component completes (step (a) in Figure 16). 
Then we proceed to the next upper level by removing the epsilon from the transition target and 
instrexec from the system state (i.e., the two symbols linked by the arrow a in Figure 16).  
Now the transition target is epsilon, and the corresponding computation component is usercode 
(the two symbols linked by the arrow b). Again, the epsilon means there is no transition target in 
this level, and we proceed to the next higher level. Then the system state is thread, and the 
transition target sequence is (syscall scheduler), step (c). As the current transition target syscall is 
not epsilon, the system state transitions to syscall by moving syscall from the transition target 
sequence to the system state. The new system state is thread.syscall, and the new transition target 
sequence is (scheduler). Following the same expansion procedure explained above, thread.syscall 
expands to thread.syscall.kernelcode.instrexec, finishing the transition from line 1 to line 2 in 
Figure 14.  
Transition target in a higher level. In the example above, the transition target is a computation 
component in the same level. However, in Figure 13(d) the transition target for the kernelcode is 
scheduler, an FSM state in a higher FSM level (in the system FSM in Figure 13 (a)). We deal with 
the problem using the following approach: during the expansion, a special FSM state, 
higherlevel_scheduler, is written in the transition target column in Figure 16 rather than the 
scheduler. Then during state transition, we directly proceed to the level of the system FSM where 
scheduler is found in the transition target sequence to match the higherlevel_scheduler.  
4.6. Implementation 
A model checker is used to reason about the system behavior and explore all possible execution 
flows, including those that lead to system hangs. Section 4.5.5 introduced two types of rules to 
 72 
govern the system state transitions: (i) expansion rules and (ii) transition rules. Since processing 
the rules involves frequent string operations, the Maude [19] model checker is selected to 
implement our formal model. Other model checker tools such as NuSMV [32] and PVS [33] excel 
at describing Boolean variables and operations.  
Maude is a high-performance reflective language and system that supports both equation and 
rewriting-rule specification and programming for a wide range of applications. Equations are 
deterministic and cannot accommodate ambiguity or non-deterministic behavior, while rewriting 
rules can describe non-deterministic behavior. 
Maude performs rewriting using equations much faster than is possible using rewriting rules. 
Therefore, our formal model is implemented using equations to achieve a fast model checking, 
and only expansions of nondeterministic behavior (corresponding to Figure 13 (d) and (i)) are 
implemented as rewriting rules.  
Model state. The core in implementing our formal model is proper encoding of the modeled 
system state, which consists of following entries: 
<SysState; TransTargets; CounterState; AttrValues; SavedAttrs> 
SysState is a list of strings representing the current system state, and TransTargets is a list of 
strings indicating the sequence of the transition targets. During the expansion illustrated in Figure 
16, thread, thread usercode, thread usercode instruexec are instances of SysState, and scheduler, 
syscall scheduler, epsilon syscall scheduler, and epsilon epsilon syscall scheduler are instances of 
TransTargets. SysState and TransTargets are manipulated according to the specifications of 
expansion rules and transition rules.  
 73 
CounterState is the current state of the hardware counter (reset, timerinterval, timeslice, or 
hangdetected). AttrValues is the set of values of processor attributes, e.g., cpumode=Kernel 
means the system is currently in Kernel mode. SavedAttrs saves the old values of processor 
attributes. Upon an interrupt, the cpumode attribute is set to Kernel, and the original value of 
cpumode is saved for restore of the cpumode value after the interrupt is serviced.  
Rule specifications. Maude rules are specified to reflect the change of the model state during 
system state expansions and transitions. The rules, except those for nondeterministic events, are 
implemented in an equation function deriveflow() which transforms an input model state into an 
output model state. Each step during system state expansions and transitions is specified as 
recursive invocation of deriveflow() with a specific input state for deterministic events, 
interleaved with execution of rewriting rules for non-deterministic events.  
For example, in Figure 16 the expansion of system to the BCU granularity consists of four steps. 
Steps 1, 2, and 3 are implemented as invocations of the deriveflow() with corresponding input 
states. Step 4 is specified as a rewriting rule because the arrival of an interrupt is a 
non-deterministic event. The system state transitions, in the same example (steps (a), (b), (c) in 
Figure 16), are also specified as invocations of the deriveflow(). These rules are given below to 
illustrate specifications expressed in the Maude syntax (the change in the model state is 
highlighted): 
(1) Expansion rules: 
Step 1: eq deriveflow((<system; ; CounterState; AttrValues; SavedAttrs>)) = 
deriveflow((<thread; scheduler; CounterState; AttrValues; SavedAttrs>)) . 
 74 
Step 2: eq deriveflow((<SysState thread; TransTargets; CounterState; AttrValues; SavedAttrs>)) 
= deriveflow((<SysState thread usercode; syscall TransTargets; CounterState; AttrValues; 
SavedAttrs>)) . 
Step 3: eq deriveflow((<SysState usercode; TransTargets; CounterState; AttrValues; 
SavedAttrs>)) = (<SysState usercode instrexec; epsilon TransTargets; CounterState; AttrValues; 
SavedAttrs>) . 
Step 4: rl [instrexecepsilon]: <SysState instrexec; TransTargets; CounterState; AttrValues; 
SavedAttrs> => deriveflow((<SysState instrexec epsilon; epsilon TransTargets; CounterState; 
AttrValues; SavedAttrs>)) . 
(2) Transition rules: 
Steps a & b: eq deriveflow((<SysState acomponent epsilon; epsilon TransTargets; CounterState; 
AttrValues; SavedAttrs>)) = deriveflow((< SysState epsilon; TransTargets; CounterState; 
AttrValues; SavedAttrs>)) . 
Step c: eq deriveflow((<SysState epsilon; syscall TransTargets; CounterState; AttrValues; 
SavedAttrs>)) = deriveflow((<SysState syscall; usercode TransTargets; CounterState; 
setattr(cpumode, Kernel, AttrValues); SavedAttrs>)) . 
Each rule transforms the input model state (the left-hand side of the rule specification) into the 
output model state (the right-hand side of the rule specification) by matching the input model state 
and performing appropriate replacement. For example, the SysState thread in the rule for the step 
2 (see the specifications above) matches any system state representation ending with the thread 
component (the SysState in the input model state is a variable representing any list of component 
 75 
names). Consequently in our example, usercode is appended to these system states, and syscall is 
added to TransTargets. 
Any system state representation ending with epsilon means the BCU granularity is reached during 
expansion and system state transition should occur. The two transition rules in our example 
demonstrate this (i.e., epsilon at the end of SysState). In step (c), the system enters the kernel 
mode by invoking a system call. This is represented by using setattr(), which sets the processor 
attribute cpumode to Kernel. After step (c) the expansion resumes from the syscall component. 
In addition to the rules for expansion and transition, there are a number of rules (not discussed 
here due to space limitation) specified for triggering timer interrupt, transitioning CounterState, 
and manipulating processor attributes (e.g., setattr()).  SysState and CounterState may change 
simultaneously in some of these rules. 
Model Checking. System hang in our model can be represented as violation of a system liveness 
property. Specifically, the property that a system executes without a hang is defined as “user code 
always and eventually executes.” Typically the system liveness property is best modeled using the 
Linear Temporal Logic (LTL) [34]. However, the LTL module supported by Maude model 
checker identifies only a single counterexample (i.e., a single case of a system hang that escapes 
detection).  
In order to uncover all system hang scenarios that escape detection (and to overcome the 
drawback of the tool), we exploit explicit-state model checking supported by the search command 
in the Maude model checker. The search command explores the state space of the model in the 
breath-first way and checks if an invariant predicate is violated. Two Boolean variables, 
hangisdetected and systemishung, are defined, and a predicate, [(not systemishung) or 
 76 
hangisdetected], is composed such that violation of the predicate identifies a system hang which 
escaped detection. In our model hangisdetected is set to true if the hardware counter reaches the 
hangdetected state (see Figure 12(c)). systemishung becomes true if the user code has not 
executed for a sufficiently long time – approximated by the bound period discussed below.  
Bounded model checking is used to limit the search state space. In our approach we bound the 
time interval, bound period, for the model to terminate rather than the depth in the breath-first 
search. Since it takes more than one time slice for SHD to detect a system hang, we determine the 
bound period as multiple of the time slice (e.g. 2 time slices). The bound period begins when the 
hardware counter is reset, which corresponds to a transition to the reset state in the hardware 
counter FSM (see Figure 12(c)).  
Another optimization made to minimize the state space explosion puts a limit on the number of 
interrupts during a bound period as system behavior repeats itself when processing a large number 
of interrupts. Moreover, when enumerating all possible execution flows using the explicit-state 
model checking, we consider only acyclic flows since they correspond to unique behavior.  
4.7. Experimental Results 
The model checking is conducted on a machine with a dual-core CPU (2.2 GHz each) and 2 GB of 
memory. The experiment results are reported as measurements in two categories in Table 10 
detection capability and performance of model checking. Note that only acyclic flows within the 
bound period are counted in the analysis. The key observations from the experiments are 
summarized below: 
 77 
Table 10: Experiment Results 
Detection Capability Performance of the Model Checking Bound 
Period # 
flows 
# system 
hangs 
# 
detected 
# not 
detected 
# states # 
rewrites 
Run time 
(ms)  
# steps in 
longest 
flow 
2 time 
slices 
11388
93 
18632 15804 
(84.8%) 
2828 
(15.2%) 
1495 69147 122 55 
3 time 
slices 
10303
1856 
1363435  1356963 
(99.5%) 
6472 
(0.47%) 
3686 171125 345 72 
• The percentage of the detected hangs is 84.8% (15804/18632) for the 2-time-slice bound period, 
and is 99.5% (1356963/1363435) for the 3-time-slice bound period, as reported in Table 10. The 
percentage is not detection coverage in the usual meaning because the value does not reflect the 
frequency of runtime occurrences of different hang scenarios. This figure gives the percentile of 
all unique hang scenarios detected by the hang detector. This is an important result, which cannot 
be obtained by employing the traditional random fault injection approach. According to Table 10, 
the number of execution flows, as interleaved execution, scales exponentially with the bound 
period. The number of flows that lead to system hangs also scales exponentially. But the number 
of flows leading to system hangs not detected by the SHD does not demonstrate this exponential 
scalability. This is because this type of system hang scenario is a small set with specific nature 
(discussed below in Section 4.7.1). 
• Most execution scenarios are in normal behavior, and only a small percentage of the execution 
scenarios lead to system hangs. With the 2-time-slice bound period, around 1.64% (18,632 out of 
1,138,893) of execution scenarios lead to system hangs; while with the 3-time-slice bound period, 
around 1.32% (1,363,435 out of 103,031,856) of execution scenarios lead to system hangs, 
slightly smaller than the percentage for the 2-time-slice bound period. 
 78 
• The model checking demonstrates very good performance: it takes 69,147 rewrites (171,125 
rewrites), including equations and rewriting rules, and 122 ms (345 ms) to model check our 
system model when the bound period is 2 time slices (3 time slices). Note that during the model 
checking the entire state space of our system model (within the bound period) is traversed and the 
longest flow within the 3-time-slice bound period has only 72 steps (a step is execution of a 
rewriting rule). The small number of the steps in the longest flow, as well as the short amount of 
time for model checking, demonstrates the effectiveness of applying formal model checking to 
study behavior of complicated real-world systems regarding specific system properties.  
Example Hang Scenarios That Escape Detection 
In this section, we discuss execution scenarios that lead to system hangs and escaped detection.  
Scenario 1. In this scenario, a user thread, executing on the processor, uses up the allocated time 
slice when an interrupt arrives. The handler invoked to serve the interrupt first disables any 
incoming interrupt and services the current one. After the interrupt service routine terminates, it 
enables interrupts and tries to return the control to the user mode. On the return from an interrupt, 
Linux checks whether a context switch should be scheduled, and in our example, the context 
switch is needed (the user thread used up the time slice). The SHD module intercepts the context 
switch and resets the hardware counter. However, before the context switch completes, another 
interrupt arrives and the same scenario repeats. Since SHD is always able to reset the hardware 
counter, it never detects the hangs.  
Scenario 2. In this scenario, the user thread invokes a system call, and the OS enters the kernel 
mode to service the user request. On return from the system call service, the OS finds that the user 
thread must wait for a certain condition to be met before it can continue the execution. Although 
 79 
the user thread does not use up the time slice, it voluntarily yields the processor, and hence, the 
OS schedules a context switch. The system hang detection intercepts the context switch, resets the 
hardware counter, and then is trapped for some unknown reason. The next time the timer interrupt 
arrives, the system handles the interrupt and tries to schedule the context switch (as the condition 
for context switch still holds). This invokes the system hang detection, which resets the hardware 
counter and then hangs again.  
Most of the hangs that escape detection are of this nature. Our formal model enables uncovering 
these not-quite-intuitive scenarios and provides a feedback to the system/application developers 
on how to enhance detection capabilities. Note that such scenarios are rather unlikely in real 
systems; e.g., an interrupt coinciding with a context switch is a low-probability event.  
4.8. Conclusions 
This chapter proposes an approach to formally verify the detection capability of a system hang 
detector. An abstract formal model of a typical Linux system is created to thoroughly exercise all 
execution scenarios that may lead to hangs. Model checking is applied to reason about the system 
behavior and to uncover the hang scenarios that escape detection. The results indicate that the 
proposed framework allows identification of corner cases of hang scenarios that escape detection 
and provides valuable insight to the developers for enhancing the detection mechanisms. 
Although a single-core standalone system is studied in this chapter, the proposed approach is 
applicable to study detection capabilities of system hang detectors in more complicated systems, 
e.g. multi-core, multi-processor, or virtualized systems. (The number of states may still be within 
capability of the state-of-the-art model checking, as indicated by the very good performance of the 
model checking reported in Table 10). In the future, we will investigate the enhancement of the 
 80 
BCU-based approach to model and verify other reliability techniques, such as 
application-transparent checkpointing [15] and middleware-level error detectors and mitigation 
techniques (e.g., [35]).  
 
 81 
CHAPTER 5 
CHECKPOINTING VMS AGAINST TRANSIENT ERRORS  
 
5.1. Introduction 
Virtual machines (VMs), also called guest systems, are frequently deployed to host a variety of IT 
services, such as web services, virtual desktops, and databases. To ensure continuous service 
availability, these systems must be capable of tolerating runtime errors. Checkpoint and rollback 
techniques can be applied to enhance VM availability. 
Virtual machine monitors (VMMs), such as VMware and Xen, provide mechanisms (a) to save a 
VM state (by stopping the VM and dumping the execution state into persistent storage) and (b) to 
migrate the VM to a remote node (e.g., [3]). Most existing VM checkpoint techniques [36], [37], 
[38] exploit these two mechanisms. For example, CEVM [36] and VNsnap [37] first use live 
migration to create a replica of the protected VM in memory and then dump the replica to disk 
offline. 
This chapter presents the description, implementation, and experimental assessment of 
VM-µCheckpoint, a VM checkpointing framework to protect both the guest OS and applications 
against runtime errors. Advantages of using VM-µCheckpoint include.  
(i) Small overhead compared with the VM replica-based failover approach.  This is achieved by 
using in-memory checkpointing and in-place recovery of VMs, i.e., recovery of a failed VM in its 
current context. No such checkpoint work has been done in the context of virtual environments.   
 82 
(ii) Alleviation of checkpoint corruption due to error-detection latency by taking advantage of 
knowledge of error detection latency. Using knowledge of fault/error latency for explicitly 
handling checkpoint corruption is new, and with it we can finally address this important problem;  
(iii) High checkpointing frequency—tens of checkpoints per second—which reduces the size of 
each increment when taking a checkpoint.  
(iv) Rapid recovery—within one second—compared to the stop-and-dump approach provided by 
VMMs. 
As a result, checkpointing during the normal system operation and recovery in response to a guest 
VM or application failure are completely transparent to the client (i.e., the client does not see a 
service interrupt). 
Traditional checkpointing techniques save checkpoints on disk to tolerate permanent failures. 
Several VM checkpointing techniques, including Remus [38], save checkpoints in the memory of 
another node. Here, we propose saving checkpoints in the memory of the same node.  
VM-µCheckpoint is designed as a complementary approach to disk-based VM checkpointing 
rather than its replacement. By providing a rapid recovery, VM-µCheckpoint significantly 
reduces the failure rate of VMs due to transient errors.18 We created analytical and probabilistic 
models (presented in Chapter 7) to assess the availability improvement when using 
VM-µCheckpoint.  
The major contributions of this work are:    
                                                        
18
 At the same time, the checkpoint kept by VM-µCheckpoint can be dumped to disk at a sufficiently infrequent rate 
to minimize overhead. This means that in the event of a node fails, the VM and the jobs in the node can be restarted 
from the last valid disk checkpoint. 
 83 
• Design, implementation, and integration of VM-µCheckpoint in Xen VMM. The 
VM-µCheckpoint implementation does not introduce any changes to the guest VMs or 
applications. Copy-on-Write (CoW), dirty-page prediction, and pre-saving algorithms are 
designed and implemented to achieve high performance. The key innovations in the proposed 
algorithms are (i) the use of dirty-page prediction and pre-saving, which are not supported by the 
default Xen’s CoW mechanism, and (ii) a mechanism to overcome the inefficiency of Xen’s CoW 
in supporting high-frequency periodic checkpointing. 
• Use of knowledge of the error detection latency to derive checkpoint intervals that minimize 
the possibility of checkpoint corruption. Our model-based analysis (in Chapter 7) shows that the 
availability of guest VMs and applications is improved from 99% to 99.98%, assuming a highly 
reliable hypervisor (MTTF of 625 days in our study). 
• An experimental assessment of VM-µCheckpoint using:  
(i) SPEC benchmark programs. The evaluation shows that VM-µCheckpoint incurs an average of 
6.3% overhead for SPEC benchmark programs with 50 ms checkpoint intervals.19 This choice 
represents a design tradeoff between keeping checkpoint size small and minimizing chances of 
checkpoint corruption due to latent errors.  
(ii) Apache server, an example network application. The results show 17.5% throughput reduction 
when taking a checkpoint every 50 ms. This overhead is significantly lower than the existing VM 
checkpointing techniques, e.g., Remus [38]. 
                                                        
19
 The recent work of fault-injection into the Linux kernel [39] shows that about 95% of crashes occur within 100 
million CPU cycles (or within 50 ms on a 2 GHz processor) after an error occurrence. We select a checkpoint interval 
of 50 ms in experiments to cover the latency for 95% of errors. 
 84 
5.2. VM-µCheckpoint Design 
Figure 17 illustrates how VM-µCheckpoint is deployed to protect virtual machines (also called 
guest systems) running on top of a hypervisor. The protected VM in the figure is the guest system 
to be checkpointed. Another guest system on the same physical machine is selected as the 
checkpointing VM where VM-µCheckpoint is installed. (The checkpointing VM can be a guest 
system dedicated to the checkpointing service, not necessarily a privileged guest system, e.g., 
Dom0 in Xen.) The hypervisor and the kernel of the checkpointing VM are instrumented to 
support checkpointing and recovery. 
Hyper-
visor
checkpointing VM protected VM
kernelkernel
app app app app
checkpoint 
agent
app
restore
check-
point
Hardware
 
Figure 17: Deployment of VM-µCheckpoint 
The starting point of the proposed approach is the observation that short-latency errors are 
dominant. This is demonstrated by several previous fault injection experiments, including recent 
work on error injection into the Linux kernel [39] that shows about 95% of crashes occur within 
100 million CPU cycles (or within 50 ms on a 2 GHz processor) after an error occurrence.  Fault 
injections into processor micro-architecture [40] also show small error latencies. State-of-the-art 
error detection techniques (e.g., [41], [42], [43]) also help to limit the error latency to low values.  
 85 
At the same time, it has been shown in many studies that a vast majority of failures are transient 
(up to 95%). Our experiments on IBM Power series systems also demonstrate that, in VM 
environments, errors impacting the hypervisor rarely affect more than a single guest VM.20 
Latency-driven selection of the checkpoint interval. We define a parameter TB to be a 
user-specified bound on error latency.  By setting the checkpoint interval greater than an 
acceptable latency bound (e.g., 95th percentile), we effectively bound the probability of a 
latent/undetected error affecting the checkpoint to be small (in the best case, <5%).  In addition, 
by always holding two checkpoints in sequence and, on detecting an error, reverting to the earlier 
checkpoint (in time), we further reduce the probability of checkpoint corruption (shown is Section 
7.1 to be less <3%).  This is primarily due to two factors:  (i) the earlier checkpoint is taken at a 
time at least TB in the past, and (ii) since by choice (i.e., per the latency distribution) the chance of 
an error occurring in an interval TB is greater than 95%, we have a high confidence that the 
checkpoint taken is error free.  Hence, VM-µCheckpoint assumes the earlier checkpoint is 
correct, or committed. When an error is detected or causes a failure, the system is rolled back to a 
committed checkpoint. 
A user-level process in the checkpointing VM, referred to as the checkpoint agent in Figure 17, 
takes a checkpoint of the protected VM periodically, at intervals of Tck (where Tck > TB), and 
stores the checkpoint in the checkpointing VM. Since at each checkpoint the Copy-on-Write 
(CoW) mechanism is invoked to indentify and store the needed state information, the checkpoint 
agent stores only a small fraction of the protected VM state rather than the entire system image. 
This approach allows the checkpoint agent to store checkpoints of multiple guest systems on the 
                                                        
20
 My colleague, Weining Gu, conducted this error injection campaign. The result of the campaign is not published 
yet. 
 86 
same physical machine and in a small amount of memory.  
5.2.1.   Checkpointing Algorithms 
At the beginning of a checkpoint interval, all memory pages in the protected VM are set as 
read-only. From that point on, any write to a read-only page triggers a page fault, the original data 
of the page is copied into the checkpoint kept in the checkpoint agent memory, and the stored 
memory page is set as writable. The checkpoint therefore consists of original data of only those 
pages updated within the interval. As mentioned above, the two most recent checkpoints are kept 
all the time. When an error in the protected VM is detected or causes a failure, (i) the last 
checkpoint is copied back into the current state of the protected VM, and then (ii) the earlier of the 
two kept checkpoints is copied into the system. This method restores the system to the state in 
which the earlier checkpoint (a committed checkpoint, which is unlikely to have been corrupted) 
was taken.  A detailed explanation of this method follows. 
Figure 18 illustrates the timelines of this checkpointing/ recovery scheme. Two complete 
checkpoint intervals, [t0, t1) and [t1, t2), are shown in Figure 18. The horizontal axis at the top of 
the figure represents error-free execution of the protected VM, while the horizontal axis at the 
bottom represents execution when an error occurs at tf_s. The error is detected (or the 
application/system fails) at tf_d (the error latency 
_ _f f d f s BT t t T= − ≤ ). At tf_d the two most recent 
checkpoints are those taken at t0 and t1. We first restore the data preserved during the time interval 
[t1, tf_d) into the protected VM, then we restore the data preserved during [t0, t1), to roll back the 
system to the state at time t0.  
 87 
t0 t1 t2
(a)
Tck
S0 DP0 S1 DP1 S2 DP2
(b)
DP0’ DP1’
H1
DP2’
H2
execution time
H0
tf_d
Tf
execution timetf_s
 
Figure 18: Timelines for Two Checkpoint Strategies: (a) CoW-B and (b) CoW-P 
In the algorithm described above, called Copy-on-Write Basic (CoW-B), setting all memory pages 
as read-only at the beginning of a checkpoint interval potentially results in a large number of page 
faults and a significant performance overhead. An optimized version of the basic algorithm called 
Copy-on-Write Pre-saving (CoW-P) is designed to reduce the resulted page faults; 
checkpoint-caused page faults are reduced by 75% when the checkpoint interval is 50 ms in our 
experiments, see Section 5.4.3).  
The CoW-B algorithm. This algorithm is depicted as the timeline (a) in Figure 18. Here are the 
notations used in our discussion: 
ti   Beginning time of the ith checkpoint interval 
Si   State of the protected VM at time ti. 
DPi (dirty pages) 
Data of the memory pages preserved by VM-µCheckpoint’s mechanism during [ti, ti+1] 
St  State of the protected VM at any time t (t∈[ti, ti+1]) 
DPi(t) Data of the memory pages preserved by VM-µCheckpoint’s mechanism during [ti, t] for 
any time t (t∈[ti, ti+1]) 
The following operation reflects inherent relationship between Si, St, and DPi(t): 
( , ( )),i t iS restore S DP t=                               (1) 
 88 
where ( , ( ))t irestore S DP t  denotes an operation of copying the data preserved in DPi(t) into their 
corresponding memory pages in St to restore the system to state Si. 
In the example error scenario shown in Figure 18, 1 _ 2f dt t t≤ ≤ , f B ckT T T≤ ≤ , and S0 is the last 
committed checkpoint. Applying the operation (1) twice, we can derive the expression that depicts 
restoration of S0: 
0 1 0 1
1 _ 0
( , ( ))
( ( , ( )), ),f f d
S restore S DP t
restore restore S DP t DP
=
=
                    (2) 
where 
1 1t
S S= , 0 1 0( )DP t DP= , and Sf denotes the system state at tf_d. At the restoration time tf_d, Sf, 
DP1(tf_d), and DP0 are all available, and we can restore the memory state of the protected VM into 
S0. After restoration, neither DP1(tf_d) nor DP0 is valid any more, as the system is now in state S0, 
so they are discarded after the restoration. 
The CoW-P algorithm.  This algorithm reduces the number of page faults by predicting the 
pages to be updated in the upcoming checkpoint interval and pre-saving the predicted pages in the 
checkpoint when this interval begins. These pre-saved pages are marked as writable and do not 
raise page faults. 
The typical checkpoint intervals selected in our approach range from tens of milliseconds to 
several seconds. Due to the space and time locality of memory accesses, pages that were updated 
recently tend to be updated again in the near future. Therefore, the pages dirtied in the previous 
checkpoint interval are used to predict the pages to be updated in the upcoming interval. 
Specifically, a page table supported by current-generation processors maintains an entry for each 
 89 
memory page.  The page entry has two control bits—the write permission bit and the dirty 
bit—which are leveraged for our prediction.  (We manipulate the shadow copy of this page table 
maintained by the VMM, rather than the page table in the guest operating system. In this way, the 
guest system’s use of its page table is not interfered with.) The write permission bit controls 
whether the page is writable, and the dirty bit shows whether the page has been updated since the 
dirty bit was last cleared. At the beginning of a checkpoint interval, both of the bits for non-dirty 
pages are cleared (i.e., set as read-only and not dirty). While the pages dirtied in the previous 
checkpoint interval are saved in checkpoint, their write permission bits are set to allow writes to 
them, and their dirty bits are cleared to enable tracking of whether they will be updated during the 
upcoming interval. If a page dirtied in the previous interval is not updated during the upcoming 
interval, then next time (i.e., after this upcoming interval) this page is not pre-saved and is set as 
read-only. Figure 18 (b) shows the timeline of the CoW-P.  
More formally, let Hi denote data of the memory pages updated in the checkpoint interval [ti-1, ti) 
(H0 is obtained by profiling system execution before t0), DPi' denote data of the pages preserved 
by CoW-P during [ti, ti+1), and DPi'(t) be data of the pages preserved by CoW-P during [ti, t) for 
any t∈[ti, ti+1]. Then we use Hi as prediction of DPi. Using the restore() operation defined in (1), 
we have: 
( , '( ) ).i t i iS restore S DP t H= ∪                         (3) 
Due to the inaccuracy of dirty page prediction, Hi includes data of pages that are not updated in [ti, 
t]. Similar to the discussion on CoW-B, the expression that represents restoration of S0 is derived 
as: 
 90 
0 1 0 1 0
1 _ 1 0 0
( , '( ) )
( ( , '( ) ), ' ),f f d
S restore S DP t H
restore restore S DP t H DP H
=
=
∪
∪ ∪
           (4) 
   where 
1 1t
S S= , 0 1 0'( ) 'DP t DP= . 
5.2.2.   Disk-Based Checkpointing 
In order to recover a VM against a permanent failure or a hypervisor failure, VM checkpoints can 
be saved in disks. Figure 19 illustrates the extension of VM-µCheckpoint to support disk-based 
VM checkpoint.  
t0 t1 tkTck
S0 DP0 S1 DP1 Sk DPk execution time
tk+1
SCANDATA
…
 
Figure 19: Disk-Based Checkpointing 
Specifically, in addition to the in-memory checkpointing described in Section 5.2.1, the 
checkpoint agent scans the protected VM and saves every memory page (i.e. the SCANDATA in 
Figure 19). Suppose the scan starts in [t0, t1) and finishes in [tk, tk+1). We define an operation 
collect such that 
0 1( , )collect DP DP  merges the data in DP0 and DP1: if a memory page is saved in both 
DP0 and DP1, only the data of the page in DP0 is preserved after the merge. Then,  
0 1 2(... ( ( , ), )..., )kDPS collect collect collect DP DP DP DP=  
0 ( , ),S collect DPS SCANDATA=  
where DPS is the t0-state of the memory pages which are modified (updated/created/deleted) 
 91 
during [t0, tk+1). So the checkpoint agent keeps DP0, DP1, …, DPk as well as SCANDATA in order 
to support disk-based checkpoint. After tk+1 the checkpoint agent writes collect(DPS, SCANDATA) 
to disk, the VM checkpoint at t0.  
5.2.3.   Discussion 
Checkpoint data. The state of a VM includes the virtual processor state, the entire kernel address 
space, and the address spaces of all the processes in the VM, including all memory pages 
belonging to these address spaces. From the hypervisor point of view, all these memory pages are 
managed as pseudo-physical pages. The hypervisor does not need to differentiate whether a page 
belongs to a user process or to the kernel.  
The checkpoint data in VM-µCheckpoint consists of the original state of virtual CPUs and the 
original state of dirty pages (pseudo-physical pages). The hypervisor provides functionality to 
capture the state of virtual CPUs. For example, in the Xen VMM, when the hypervisor suspends a 
VM, the state of the virtual CPUs is frozen and can be saved in a hypervisor data structure called 
vcpu_guest_context. This data structure is listed below (for Xen 3.3.1 on top of 32bit Intel x86 
processors). From the data structure, we can see that all the user registers, control registers, debug 
registers, FPU registers, etc., are saved in the checkpoint.  
struct vcpu_guest_context { 
    /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */ 
    struct { char x[512]; } fpu_ctxt;       /* User-level FPU registers     */ 
    unsigned long flags;                    /* VGCF_* flags                 */ 
    struct cpu_user_regs user_regs;         /* User-level CPU registers     */ 
    struct trap_info trap_ctxt[256];        /* Virtual IDT                  */ 
    unsigned long ldt_base, ldt_ents;       /* LDT (linear address, # ents) */ 
    unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */ 
    unsigned long kernel_ss, kernel_sp;     /* Virtual TSS (only SS1/SP1)   */ 
    /* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */ 
    unsigned long ctrlreg[8];               /* CR0-CR7 (control registers)  */ 
    unsigned long debugreg[8];              /* DB0-DB7 (debug registers)    */ 
 92 
    unsigned long event_callback_cs;        /* CS:EIP of event callback     */ 
    unsigned long event_callback_eip; 
    unsigned long failsafe_callback_cs;     /* CS:EIP of failsafe callback  */ 
    unsigned long failsafe_callback_eip; 
    unsigned long vm_assist;                /* VMASST_TYPE_* bitmap */ 
}; 
 
Because we leverage the shadow-paging feature of the Xen (all of the guest system’s 
pseudo-physical pages are managed by the shadow page table; more details are given in Section 
5.3.1), we are able to control whether a page is read-only and to trace whether a page is dirty. 
When there is a write to a read-only page, a page fault is triggered and reported to the hypervisor, 
and we can save the pre-write state of the page in the checkpoint.  
There are exceptional cases when a memory page may be updated with the shadow paging 
bypassed. For example, if a privileged instruction in the guest system tries to write to a page, the 
instruction is trapped into the hypervisor due to the privilege violation. Then the hypervisor 
emulates the instruction by mapping and writing to the target page. To deal with this issue, we 
may modify the emulate_privileged_op() function in Xen such that the pre-update state of the 
modified pages is correctly checkpointed.  
Another example is the memory shared between different VMs on top of the same physical 
machine. For example, Xen provides the grant table mechanism, which allows memory pages of a 
VM to be shared with other VMs. This can be used for transferring I/O data between domain 0 
and another VM. In this case, if the memory page is provided by domain 0, then the page is not 
managed by the shadow page table in the protected VM.  
We found such a scenario. The network driver of a VM consists of two parts, the front part in the 
VM and the back part in the domain 0. The two parts share a data buffer. When traffic data is 
 93 
received from outside, the back end fills the buffer with data (and marks the corresponding buffer 
slot as occupied) while the front end reads data from it (and marks the buffer slot as empty). Let us 
say the front end is to read slot 8 of the buffer when a failure occurs. The VM recovers to its 
checkpoint when the front end is to read a previous buffer slot, say slot 3. As the buffer is owned 
by domain 0, the buffer is not checkpointed or rolled back upon the failure. So the front end, after 
recovery, finds the slot is empty and reports a kernel exception. To handle this problem, we should 
save the shared pages and the relevant data structure state in the checkpoint. When the VM gets 
recovered upon a failure, the checkpointed shared pages are restored in the domain 0. The back 
end of the network driver in domain 0 should be modified appropriately to know about this 
recovery and be able to work correctly with the recovered state.  
Selecting TB and Tck.  It should be clear from the above discussion that both error latency and 
checkpoint overhead are considered when selecting a checkpoint interval Tck. In our scheme, 
checkpointing with larger interval incurs smaller overhead while causing a longer output delay 
and a larger checkpoint size (when output commit is employed).  Hence, there is a trade-off in 
Tck selection. For example, if a small output delay is desired, a small Tck is preferred as long as Tck 
is no larger than the selected TB and checkpoint overhead is acceptable. 
Error detection latency depends on error detection techniques (e.g., [41], [42], [43]). As error 
detection is not in the scope of this work, we consider system failure as a kind of error detection. 
To obtain the distribution of error detection latency, we inject errors into a target system and 
measure the latency from error activation to failure occurrence. We conducted an analytical model 
to study the impacts of Tck on checkpoint corruption and system availability in Chapter 7. Based 
on the analytical model and the obtained error latency distribution, we can select the proper Tck. 
 94 
Error model. VM-µCheckpoint recovers a guest system and applications in the system from any 
transient hardware error or transient software error (including both application and system errors). 
Transient hardware errors include those occurring in the processor (functional units, registers, 
caches, buses, and control logics) and memory due to events such as radiation or current 
disturbance. Transient software errors, or Heisenbugs [44], include exceptional conditions (e.g., 
counter overflow and interrupt arrival with bad timing), occasional device driver faults, race 
conditions, and corrupted parameters or data due to bad transmission. Note that transient failures 
of the checkpointing VM are handled by an immediate restart of the failed checkpointing VM. 
VM-µCheckpoint cannot guarantee recovery if either of the following holds: (i) Checkpoint 
corruption. There is a small but finite probability of checkpoint corruption.  In this case, 
VM-µCheckpoint aborts recovery and restarts the VM and the interrupted jobs. (ii) Failure of the 
hypervisor due to a transient fault. In this case, we first restart the hypervisor and restart all jobs 
executing prior to the failure. If this is unsuccessful, the system rolls over to an adjacent physical 
node and restarts.  
I/O handling. This work focuses on the analysis, design, and implementation of memory-state 
checkpointing in VM-µCheckpoint. For I/O handling, we adapted the output-commit mechanism 
applied in [38], [45] to fit into VM-µCheckpoint. In this mechanism, output of a system is held 
(i.e., not delivered to hardware devices) until a checkpoint is taken. This mechanism masks 
recovered errors of the system, i.e., these errors are not viewed by other components (disks, 
network cards, nodes, etc.).  
Basically, the checkpoint agent in VM-µCheckpoint is designed to hold and release output of the 
protected guest system; if preferred, a copy of input to the protected VM is saved in the 
 95 
checkpoint agent for replay. The hypervisor and the checkpointing VMs are instrumented to 
provide support to the checkpoint agent for this purpose. Note that the checkpoint agent maintains 
two pools of held outputs and saved inputs corresponding to the two checkpoints.  
5.3. Implementation 
Our fully working prototype is implemented in Xen. The source codes of the Xen hypervisor and 
the checkpointing VM are instrumented, while there is no change to the protected VM.21 
Xen supports multiple types of virtualized guest systems, including ParaVirtualized systems (PV) 
and Hardware-aided Fully-Virtualized systems (HVM). We select a Xen PV guest system as the 
protected system in our implementation because we have more experience with the Xen PV 
system. The proposed VM-µCheckpoint mechanism can also be implemented to support Xen 
HVM systems. 
5.3.1. Overview of Shadow Paging 
The VM-µCheckpoint implementation leverages a feature of virtualization technology, shadow 
paging, for identifying all memory pages belonging to a specific virtual machine, setting page 
access privileges, and intercepting page faults for preserving memory pages.  
Two types of addresses are recognized in an operating system: linear address and physical 
address. A linear address is the reference of a memory object in a process address space; a 
physical address provides the information of where the memory object is really located in the 
physical memory.  Note that a processor is always within a process context, no matter when the 
                                                        
21
 The I/O recovery mechanism is not implemented yet in the current prototype. 
 96 
processor is executing a user-level instruction, servicing an interrupt/system call, or doing any 
other task. As a result, a linear address is always used to access a memory object. 
In a system without virtualization, the processor translates a linear address into the corresponding 
physical address and accesses the correct object in physical memory via a memory management 
unit (MMU) in the processor. The MMU consults a page table, maintained by the operating 
system, for address translation.  
The shadow-paging feature of the Xen allows a separate page table (shadow page table, or SPT) to 
be created for a virtual machine, which also maintains its own page table (guest page table, or 
GPT). The SPT is different from the GPT in that the SPT maps a linear address to the real memory 
location in the physical machine (called the machine address), while the GPT maps the linear 
address to the virtual physical address in the guest operating system’s view (called the guest 
physical address).  
Two points should be noted here: (i) Xen is able to access any memory location with a given 
machine address by setting up a mapping from a linear address within the Xen hypervisor address 
range (this address range is shared among all guest system address spaces) to the given machine 
address, and (ii) the MMU hardware looks up in the SPT instead of the GPT while translating 
linear addresses.  
The Xen hypervisor synchronizes the SPT with the GPT so that the two page tables are consistent 
at all times, i.e., any linear address of a process in the guest system has the correct virtual physical 
address in the GPT and the correct real machine address in the SPT with the same bookkeeping 
information in the two tables.  
 97 
The guest system is unaware of the SPT’s existence. Xen implements the SPT in a transparent 
fashion for systems on processors lacking native support to shadow paging. Moreover, 
latest-generation processors have built-in design (e.g., Intel Extended Page Table and AMD 
Nested Page Table) to support shadow paging, which masks the SPT to guest operating systems. 
5.3.2. Data Channel 
The checkpoint agent allocates a number of memory pages for storing two checkpoints. Each of 
these memory pages has the following information kept in a page record:  (i) a pointer to the 
memory page in the checkpoint agent address space, (ii) the machine address of this memory page, 
and (iii) the guest physical address of the saved page in the protected system (i.e., the content of 
this page is stored in this memory page).  
The page records of these memory pages for storing checkpoint are organized in a circularly 
linked list (shown in Figure 20). The list is linked with both pointers in the checkpoint agent 
address space and machine address information.  Thus, both the checkpoint agent and the Xen 
hypervisor are able to traverse the page list and access these memory pages.  
…
Memory pages for 
checkpoint storage 
List of page records… …
Checkpoint 0 Checkpoint 1 Checkpoint 2
 
Figure 20: Memory Pages for Checkpoint Storage 
This list of memory pages for checkpoint storage is the data channel (shown in Figure 17) set up 
between the protected guest system and the checkpoint agent. At any time two checkpoints are 
 98 
maintained by the VM-µCheckpoint, they coexist in this list of pages (for example, checkpoint 0 
and 1 are in the list of pages in Figure 20). Each checkpoint occupies a range of these pages. 
When the latest checkpoint (checkpoint 1 in the figure) is committed, the older checkpoint 
(checkpoint 0) is discarded by invalidating its range of pages, and the new checkpoint (checkpoint 
2) is stored following the last occupied page.  
Besides hosting the memory pages for checkpointing memory data in the protected system, the 
checkpoint agent provides space for storing the states of all virtual CPUs in the protected system.  
5.4. Experimental and Measurement Results 
The testbed consists of a physical machine with an AMD Athlon 2800 (1.8 GHz) processor and 
1.5 GB memory. There are only two guest systems (Linux 2.6.18) running on top of Xen 3.3.1 in 
the testbed. The Dom0 is selected as the checkpointing VM, and the other guest system (a DomU) 
is the protected VM. 512 MB and 1 GB memory are assigned to the Dom0 and the DomU, 
respectively. We use only two VMs in experiments in order to accurately measure performance 
overhead in a relatively simple deployment.  
To summarize major findings in our experiments: 
a) VM-µCheckpoint achieves much better performance than existing migration-based VM 
checkpoint. For workload of SPEC CINT 2006 benchmark and checkpoint frequency of 20 times 
per second (Tck=50ms), an average of 6.3% overhead is incurred when CoW-P is deployed. With 
the same checkpoint algorithm and checkpoint frequency, Apache server throughput is reduced by 
17.5%. In contrast, Remus [38], a migration-based VM replication/checkpoint technique, reports 
approximately 50% overhead in their experiments for the same checkpoint frequency. If we 
 99 
reduce the frequency to 5 times per second the average overhead is only 3.8% for the SPEC CINT 
2006 workload.  
b) VM-µCheckpoint achieves much better performance than existing migration-based VM 
checkpoint. For workload of SPEC CINT 2006 benchmark and checkpoint frequency of 20 times 
per second (Tck=50 ms), an average of 6.3% overhead is incurred when CoW-P is deployed. With 
the same checkpoint algorithm and checkpoint frequency, Apache server throughput is reduced by 
17.5%. In contrast, Remus [38], a migration-based VM replication/checkpoint technique, reports 
approximately 50% overhead in their experiments for the same checkpoint frequency. If we 
reduce the frequency to 5 times per second, the average overhead is only 3.8% for the SPEC CINT 
2006 workload and around 9% for the Apache web server (using CoW-P). 
c) The speedup the CoW-P algorithm gains over CoW-B is significant when checkpoint 
frequency is high. With CoW-P deployed with 50 ms checkpoint intervals, Apache throughput is 
82.5% of the baseline performance, which is larger than 74.3% when CoW-B is deployed. 
d) Checkpoint sizes are relatively small with short checkpoint intervals selected in our 
experiments. The results show that, with 50 ms checkpoint intervals (using CoW-P), all 
checkpoint sizes are less than 2000 memory pages (8 MB) with an average of 655 pages (2.6 MB) 
for the SPEC CINT 2006 workload (the size of the entire system state is up to 51461 memory 
pages, or 206 MB).  
5.4.1. Program Execution Time 
A set of SPEC CINT 2006 benchmark programs are executed in the protected guest system with 
VM-µCheckpoint deployed. A suite of experiments are conducted involving each of these 
 100 
benchmark programs: (i) a baseline case (no checkpoint), (ii) CoW-B algorithm deployed with 4 
different checkpoint intervals (1000 ms, 600 ms, 200 ms, and 50 ms), and (iii) CoW-P algorithm 
deployed with the same 4 intervals. A given program executes with the same input across all 
experiments.  
Program execution times are measured, and normalized execution times are illustrated in Figure 
21 (95% confidence intervals of execution times are computed but are not presented in this figure). 
Normalized execution time is computed by dividing program execution time by the execution 
time in the corresponding baseline case. We observe from Figure 21 that: 
 
i) For all programs the impact of the checkpoint on the program execution time is no more 
than 11% (the normalized execution times are no more than 1.11) and the average overhead is 
6.3% (the average of the normalized execution times is 1.063) when the CoW-P algorithm is 
deployed with 50 ms checkpoint intervals. Compared with around 50% overhead in Remus, this is 
great improvement. If we increase the checkpoint interval to 200 ms, the average overhead is now 
3.8% (using CoW-P). 
0.9
0.95
1
1.05
1.1
1.15
1.2
perlbench bzip2 gcc milc namd dealII povray omnetpp astar sphinx3 xalancbmk specrand
No
rm
a
liz
e
d 
Ex
e
cu
tio
n
 
Ti
m
e
baseline CoW-B (1000ms) CoW-P (1000ms) CoW-B (600ms) CoW-P (600ms) CoW-B (200ms) CoW-P (200ms) CoW-B (50ms) CoW-P (50ms)
 
Figure 21: Experiment Results in Terms of Execution Time of SPEC CINT 2006 
 101 
ii)  The performance overhead increases as checkpoint frequency grows. 
iii) Use of CoW-P gains larger speedup over CoW-B for high checkpoint frequency. This is 
because the pre-saving in CoW-P reduces the number of page faults when the checkpoint interval 
is small. For low-frequency checkpoint, the pre-saving does not provide much improvement; in 
certain cases it even degrades checkpoint performance. Such performance degradation can be 
observed in experiments for perlbench, omnetpp in Figure 21. This result is due to the fact that 
memory access locality plays a significant role when checkpoint intervals are short. With a 
checkpoint interval as large as 1 s, there are (in general) a lot of mispredictions, and the pre-saving 
does a lot of wasteful work preserving pages not to be updated. 
5.4.2.   Web Server Throughput 
We conducted experiments to study how VM-µCheckpoint affects Apache web server throughput 
when the web server runs on the protected guest system. Web clients reside on three physical 
machines with each machine hosting 50 clients. These clients request the same load of web pages, 
one request immediately after another, from the server simultaneously via a 100 Mbps LAN. The 
output-commit mechanism is disabled in these experiments (as the I/O handling is not the focus of 
this work), and consequently, we compare our performance with Remus results when the output 
commit is also disabled. 
 102 
0
50
100
150
200
250
300
basel i ne     =1000ms     =600ms     =200ms     =50ms
n
u
m
be
r
 
o
f 
w
eb
 
r
e
qu
es
ts
/s
ec
CoW- B
CoW- P
91.1% 90.7% 85.8%
74.3%
92.7% 91.9% 90.9%
82.5%
100%
Tck Tck Tck Tck
 
Figure 22: Impacts of VM-µCheckpoint on Apache Web Server Throughput (percentage 
represents the ratio between the corresponding throughput and the baseline throughput, 
e.g., 82.5%=229.8/278.7) 
We measured throughputs of the web server in multiple experiments with VM-µCheckpoint 
deployed at different checkpoint intervals. As the same load of web requests are processed in 
these experiments, the measured throughputs can be compared for evaluating the impact of our 
checkpoint on throughputs. 
Figure 22 shows the measured server throughput as a function of checkpoint intervals. The 
percentages indicated along the data points on the graph represent the ratio between the 
throughput measured with the checkpoint deployed and the throughput in the baseline case (when 
checkpoint is not deployed). The three observations we made from measurements of program 
execution times in Section 5.4.1 are confirmed by the throughput results. 
    i) The throughput is reduced by 17.5% when a checkpoint is taken 20 times per second. 
Remus reports approximately 50% overhead for SPECweb benchmark with the same checkpoint 
frequency (when output commit is enabled their overhead is around 72%). If the checkpoint 
interval is increased to 200 ms, the throughput is reduced by only 9%. Our overhead results are 
 103 
conservative because we run a stressful load of web requests in experiments, and the typical 
server workload is not as intensive. 
    ii) Checkpoint overhead increases with higher checkpoint frequency.  
    iii) The CoW-P algorithm has performance improvements over CoW-B, especially in cases 
with small checkpoint intervals (the gaps between the two curves keep increasing and are much 
larger for intervals of 200 ms and 50 ms in Figure 22). 
5.4.3.   Overhead Measurement  
The number of checkpoint-caused page faults is a direct measurement of the time overhead of our 
checkpointing (other page faults are not counted as checkpoint overhead). Checkpoint size, i.e., 
the number of memory pages kept in a checkpoint, is the space overhead measurement.  
A number of experiments are conducted to study VM-µCheckpoint overhead with different 
checkpoint algorithms (i.e., CoW-B or CoW-P) at different checkpoint intervals. In each of these 
experiments, 12 programs of SPEC CINT 2006 benchmark are executed in a sequential way, and 
the total duration is about half an hour. We measure the numbers of checkpoint-caused page faults 
and the checkpoint sizes (in terms of numbers of memory pages) in every checkpoint interval (e.g., 
50 ms) of this experiment duration. For example, Figure 23 shows the numbers of 
checkpoint-caused page faults during an experiment with CoW-B deployed at a checkpoint 
interval of 1000 ms (the x-axis represents the execution time of the experiment). The labels 1 to 
12 indicate the periods in correspondence to the executions of each of the 12 programs, 
respectively. 
 104 
0
1000
2000
3000
4000
5000
6000
7000
1 501 1001 1501 Time (s)Elapsed
1 2 3 4 5 6 7
8
9 10 11
12
 
Figure 23: Checkpoint-Caused Page Faults When CoW-B Is Deployed at the Checkpoint 
Interval of 1000 ms  
Table 11: Average Checkpoint-Caused Page Faults in Experiments 
Algorithm Tck(ms) Overall perl-benc
h 
gcc milc dealII 
50 491.1 342.4 396.0 1112.7 235.8 CoW-B 
1000 1057.2 1076.6 1398.1 1298.0 996.0 
50 124.7 139.7 172.2 35.7 163.3 CoW-P 
1000 521.5 182.0 527.9 319.7 842.8 
Table 12: Average Checkpoint Sizes (in Number of Memory Pages) in Experiments 
Algorithm Tck(ms) Overall perl-benc
h 
gcc milc dealII 
50 491.1 342.4 396.0 1112.7 235.8 CoW-B 
1000 1057.2 1076.6 1398.1 1298.0 996.0 
50 654.5 565.7 626.3 1154.6 524.2 CoW-P 
1000 2162.4 1260.0 2144.6 1873.9 3151.6 
Results. The overall checkpointing overheads measured throughout the experiments, as well as 
the overheads for several individual programs, are presented in Table 11 (time overhead) and 
 105 
Table 12 (space overhead). The major findings from the experimental results are summarized 
below. 
 Average checkpoint sizes are very small, less than 2% of the size of the entire system state 
when checkpoint interval is 50 ms. Table 12 shows that with CoW-P deployed at a checkpoint 
interval of 50 ms, the average checkpoint size is 654.5 memory pages or 2.6 MB, while the size of 
the entire system state during the experiment is up to 51,461 memory pages (206 MB). The 
maximum checkpoint size observed is less than 8 MB (2000 pages; due to space constraints the 
figure showing this data is not given here), less than 4% of the entire system state size. When the 
checkpoint interval is increased to 1000 ms, most checkpoints are less than 10,000 pages, and the 
average size is 2162.4 pages (8.6 MB, or 4.2% of the entire state).  
 Dirty page prediction and pre-saving effectively reduce page faults by 75% when the 
checkpoint interval is 50 ms (124.7 page faults in CoW-P, while 491.1 in CoW-B, as shown in 
Table 11). When the checkpoint interval is 1000 ms, CoW-P still achieves 51% reduction of page 
faults (1057.2 reduced to 521.5). The reduction is less for larger checkpoint intervals, as there is 
more memory access locality within shorter intervals.  
An interesting observation is that there are a small number of single peak values in Figure 23 (up 
to 21,363; they are cropped to make the figure easier to read). These peak values are caused by a 
program’s loading of a large amount of data (mostly for read).  
5.4.4.   Experiments on Virtual Machine Recovery 
Experiments were conducted to test the correctness of the proposed technique in recovering a 
virtual machine and to measure the recovery time for evaluation. We regard application failure as 
 106 
error detection in these experiments rather than installing an error detector to do this job. For this 
purpose, a small program is developed that causes a segmentation failure after executing for a 
while. The instrumented hypervisor-level exception handler then issues an “error detected” 
request via a divide-by-zero exception. 
The SPEC CINT 2006 benchmark programs run as the workload on the protected virtual machine 
in these experiments. The small program is launched to generate a failure while the workload is 
running. The protected virtual machine is then rolled back to the last committed checkpoint. The 
measured recovery time depends on the number of memory pages restored during recovery. As 
most of checkpoint sizes range from several hundred to several thousand memory pages (shown 
in Table 12), the measured recovery time ranges from 144 ms to 1017 ms with the average of 
639.4ms (the 95% confidence interval is 639.4 ms ± 193.1 ms). 
  
 107 
CHAPTER 6 
INTEGRATING VM CHECKPOINTING INTO RMK 
 
In Chapters 2 and 3, we discussed the RMK architecture and a number of error detection 
techniques (e.g., system hang detection) on standalone systems. In Chapter 4, we discussed the 
technique of checkpointing VMs. In this chapter, we see how the RMK architecture is enhanced 
for a virtualized environment and how the enhanced RMK allows error detection and 
checkpoint/recovery to be integrated in virtualized environments. 
6.1. RMK Deployment in Virtualized Environment 
Figure 24 depicts the RMK deployment in the protected VM and the hypervisor. Similar to the 
RMK in a standalone system, the RMK is installed as a device driver in the VM. The kernel 
source code is not required, as the kernel of the VM does not need to be recompiled. Figure 24 
shows the SHD module is installed in this RMK together with two RMK pins: P_PMC and 
P_SCHL.  
The Xen hypervisor does not provide a standard mechanism to allow for dynamically loading 
code into the hypervisor space as the Linux kernel module does. So we instrumented the Xen 
hypervisor and encapsulated the hypervisor-level RMK (including the RMK core and RMK 
modules/pins) into a Xen hypercall. A hypercall is like a system call for a hypervisor. The 
instrumented Xen is recompiled to get the RMK installed in the hypervisor. 
 108 
Hypervisor
checkpointing VM protected VM
app appCheckpoint agent
Hardware
RMK core
RMK core
P_VCPU P_PTABLE
System Hang 
Detection
P_SCHL
P_PMC
COWB COWP
P_VMSIGN AL
app
guest OS
recovery
P_PMC_HELPER
guest OS
P_VMSCHL
 
Figure 24: Integrated Error Detection and Checkpoint/Recovery under RMK in a 
Virtualized Environment 
The RMK modules COWB, COWP, and recovery implement the two checkpoint algorithms and 
the recovery algorithm in VM-µCheckpoint. The RMK pins P_VCPU and P_PTABLE wrap the 
hypervisor functionalities of manipulating virtual CPUs and the shadow page table, respectively. 
P_VMSIGNAL intercepts received signals, and P_VMSCHL intercepts scheduling of VMs by the 
hypervisor. 
6.2. Detecting and Recovering from VM/Application Crashes 
When a guest system or an application in this guest system crashes, e.g., due to a NULL pointer or 
segmentation fault, an exception is raised by the hardware and reported to the hypervisor. The 
P_VMSIGNAL pin intercepts this exception, determines whether the exception causes a crash 
(e.g., segmentation fault) or not (e.g., page fault), and produces an EVT_ERRORDETECTED 
event if the exception causes a crash. Note that here we capture crashes of both the protected VM 
 109 
and applications in the VM.  
The recovery module subscribes the event. Upon receiving the event, it suspends the failed guest 
system (or the guest system in which the failed application is located) and requests the checkpoint 
agent for recovery. The checkpoint agent then recovers the guest system state with its checkpoint 
via the recovery module (details of the checkpointing and recovery procedures are given in 
Chapter 5).  
6.3. Detecting and Recovering from VM Hangs 
Recall that we count executed instructions to detect system hangs in the SHD module (Chapter 3). 
In a system executing equal-priority tasks, the number of instructions executed between two 
consecutive context switches in the protected VM is bounded. The VM system in a hang state 
does not relinquish the virtual CPU and does not schedule any thread. When the instruction 
number between consecutive context switches in the VM exceeds a preset value, a VM hang is 
flagged. 
Specifically, the P_SCHL pin intercepts the guest system scheduler by means of binary rewriting 
(so there is no need to recompile the guest system kernel). When a context switch occurs in the 
VM, a hardware counter is reset to zero by the P_PMC pin. Because the hypervisor does not allow 
the protected VM to modify hardware counters (privileged instructions are involved), the 
P_PMC_HELPER pin in the hypervisor modifies the hypervisor source code to grant such 
accesses to the VM.  
The P_VMSCHL pin instruments the hypervisor scheduler to check whether the hardware counter 
value exceeds a preset threshold value. If the value exceeds the threshold, P_VMSCHL generates 
 110 
an EVT_ERRORDETECTED event, which will be processed by the recovery module and the 
checkpoint agent to recover the VM from its checkpoint. Compared to the system hang detection 
on the standalone system (Chapter 3), we do not need to take advantage of NMI interrupt to flag a 
system hang. This is because the hypervisor allows for checking the hardware counter value 
directly, even when the VM has failed.  
Moreover, P_VMSCHL ensures that the hardware counter counts only instructions executed by 
the protected VM. When the VM is switched off the processor by the hypervisor, the hardware 
counter is suspended; when the VM is switched onto the processor again, the hardware counter is 
resumed. Here we assume only one VM is protected. When multiple VMs are to be protected, we 
can keep in the hypervisor multiple variables that record the counts for these VMs, and the 
number of executed instructions counted by the hardware counter is added to one of these 
variables properly for the corresponding VM.  
6.4. Error Injection Experiments   
We conducted error-injection experiments to show the validity of the automatic approach that 
RMK provides for error detection and VM checkpoint/recovery. Our experiments showed that, if 
the error injector is a script or program within the protected VM, the VM is recovered from the 
checkpoint successfully by RMK, but the same error is injected again immediately. This is 
because our error injection script/program is deterministic even when pseudo-random numbers 
are used for the error injection. 
Therefore, to avoid this repetition of error injection in an automatic experiment campaign, 
information from outside the protected VM is required to control the error injection behavior. For 
example, a flag variable in the protected VM can record whether an error is to be injected, and this 
 111 
flag variable is set by the hypervisor outside the protected VM. Because we just want to show the 
validity of our approach rather than do a systematic error injection campaign, we did not 
implement such an automated experiment, but rather conducted these experiments manually. As a 
result, only tens of experiments were run. To make our life easier in these manual experiments, a 
simple program is applied as the workload.  
Table 13 lists the results of our manual experiments. Three kinds of errors were injected, as 
described in the following sections.  
Table 13: Results of Error Injection Experiments 
Experiments Fault/Error Injected 
Faults/Errors 
Activated 
and Detected 
Recovered 
Signal-triggered 
crashes 
Sends SIGTERM to the 
application in the protected VM 
(fail-stop) 
35 35 35 
Bit flips into 
kernel registers 
Suspends the protected VM, flips a 
bit in a register value in the VM, 
and resumes the VM 
85 31 activated 
and detected 
31 
System hangs 
(threshold of 
200 ms) 
Loads a device driver which runs 
an infinite loop in the protected 
VM 
30 30 24* 
*Due to inconsistent shared state between the protected VM and Dom0 for I/O operations 
6.4.1. Signal-Triggered Crash 
We kill the workload application by sending a SIGTERM signal to the application process. On 
receiving this signal, the workload program executes a divide-by-zero instruction, which traps the 
processor into the hypervisor. The P_VMSIGNAL pin then generates the 
EVT_ERRORDETECTED event, and the recovery module restores the checkpoint to the 
protected VM. We conducted 35 experiments, and in all of them the protected VM is successfully 
recovered. 
 112 
6.4.2. Bit Flips in VM Kernel Registers 
We also injected bit-flip faults into kernel registers of the protected VM. The error injector is 
placed outside the protected VM, i.e., in the dom0, to avoid repeating the same error injection 
after recovery from the checkpoint.  
The error injector in the dom0 first suspends the protected VM via the hypervisor. As a result, the 
state of the virtual CPUs in the VM is saved in a hypervisor data structure called 
vcpu_guest_context. Then we randomly select a bit in the generic register file (i.e. EAX, EBX, 
ESI, EDI, ESP, EIP, CS, SS, ES, etc.) in the vcpu_guest_context and flip it. Then we resume the 
protected VM, and the flipped value is written back to the corresponding register in the virtual 
CPUs.  
We did tens of experiments and found that the activation rate is fairly small (i.e., only 31 out of 85 
injected faults get activated). For the activated and manifested errors, all of them are detected by 
the P_VMSIGNAL pin and are successfully recovered. To better evaluate the coverage of the 
checkpoint/recovery, we then injected into only the EIP register. Now all of these faults are 
detected and recovered.  
6.4.3. System Hangs  
We also injected system hangs in the protected VM. Specifically, we load in the VM a device 
driver that runs an infinite loop. In the 30 experiments conducted, there are 6 cases in which the 
recovery from the checkpoint fails. 
We looked into the details of the cases when recovery fails and discovered that the shared state 
between the protected VM and the dom0 for handling I/O operations is inconsistent with the state 
 113 
of the protected VM after recovery from the checkpoint. Figure 25 illustrates the details of the 
inter-domain shared memory for I/O operations in Xen.  
hardware
hypervisor
Dom0 Protected VM
blkback blkfront
grant table
Shared 
memory
Shared 
memory
 
Figure 25: Inter-Domain Shared Memory for I/O Operations in Xen 
The Xen hypervisor uses a split driver model for handling I/O operations (network, disk, etc.). 
The blkfront is the front end of the driver in the protected VM, and the blkback is the back end in 
Dom0 (as shown in Figure 25). Shared memory is used to facilitate I/O data transfer. These shared 
states include request ring buffer, producer/consumer pointers (blkfront and blkback follow a 
producer-consumer model), buffers, protocol status for the split driver, event channel state, etc.  
When a request arrives at either the blkfront or the blkback, a shared buffer is created in the 
protected VM or the dom0 to host I/O data, and this buffer is registered through the grant table 
mechanism in the hypervisor. After processing the request, the buffer is released through the grant 
table.  
If the protected VM is recovered from checkpoint, the blkfront may expect a shared buffer to be 
present and registered in the hypervisor’s grant table. But this may not be true. The recovery then 
fails because a non-existent buffer is accessed. Our experiments show that this scenario happens 
when the error detection latency is large. That is why we only observed failure of recovery in 
cases when system hangs are injected (200 ms is used as the threshold value for detecting VM 
 114 
hangs).  
To handle this problem, we should instrument the hypervisor and the blkback driver in the dom0 
to handle I/O correctly. Besides holding output data until checkpoint is committed (similar to the 
“output-commit” mechanism used in Remus and Revive I/O), we should also save the shared 
memory and the grant table in the checkpoint. 
This work is not done yet in the current prototype implementation because I/O handling is not the 
research focus of this thesis work, and significant engineering efforts are required to analyze the 
source code because the split-driver model is not well documented. 
 
 115 
CHAPTER 7 
MODEL-BASED ANALYSIS OF VM CHECKPOINTING  
 
7.1. Checkpoint Corruption Model  
Two factors are important in determining checkpoint corruption: error occurrence instant and 
error detection latency. Both are addressed in our latency-driven checkpointing provided by 
VM-µCheckpoint. In this section, we construct a model of checkpoint corruption scenarios to 
study how the latency-driven checkpointing alleviates checkpoint corruption.  The following 
three assumptions are made in our model to simplify the analysis while still providing valuable 
insight into checkpoint corruption behavior:  
    (i) Unmasked errors22 are eventually detected by either application-level (e.g., embedded 
assertions) or system-level (e.g., application failure, exception handling, kernel panic) detection 
mechanisms, and only detected errors can trigger checkpoint-based recovery;  
    (ii) Error occurrence probability is uniformly distributed during any given period; and 
    (iii) Error latency is exponentially distributed.  
The checkpoint corruption model is constructed for a given unmasked error occurrence. We first 
identify the checkpoint interval in which the error occurrence falls. As an example, Figure 26 (a) 
shows that an error occurs between chkpt0 and chkpt1. The time offset of the error occurrence 
                                                        
22
 An unmasked error is a transient error that remains alive throughout the program life and is not overwritten by the 
program. 
 116 
relative to the time of chkpt0 is denoted as a (0 ≤ a<Tck). The system continues execution after the 
error occurrence, and the error is detected after a latency of l (also shown in Figure 26 (a)). a and l 
are two random variables. a is uniformly distributed within [0, Tck), and l is exponentially 
distributed at a rate λ . Then, the pdf (probability distribution function) for a is given by:  
1( ) : ( ) , [0, )ck
ck
pdf a f x x T
T
= ∈ , 
and the pdf for l is given by: 
( ) : ( ) , [0, )ypdf l g y e yλλ −= ∈ +∞ . 
Because the error latency is independent of the error occurrence, a and l are independent random 
variables. Then, we can derive the pdf for a+l as follows: 
( ) : ( , ) ( ) ( ), [0, ), [0, )ckpdf a l h x y f x g y x T y+ = ∈ ∈ +∞ . 
time 
Tck
error occurrence 
a l
error detection/ 
failure 
chkpt0 chkpt1 chkpt2
time 
error occurrence 
a l
error detection/ 
failure 
chkpt0 chkpt1 chkpt2
(a)
(b)
Tck
Tck Tck
 
Figure 26: Timeline of Checkpoint Corruption Scenarios 
Note that any checkpoint taken after the occurrence of an unmasked error and before the detection 
of the error (e.g., chkpt1 in Figure 26(a)) must be corrupted. Otherwise, if chkpt1 was in a correct 
state, then the error detection/failure in Figure 26(a) could not have happened; there are no other 
 117 
errors. In our model, this condition of checkpoint corruption is represented as &ck cka T a l T< + > . 
The probability of the error corrupting this checkpoint is: 
{ & } { } 1 { }ck ck ck ckP a T a l T P a l T P a l T< + > = + > = − + ≤  for [0, )cka T∈ . 
( )
0 0 0
{ } ( , ) ( ) ( )
1 1 1(1 ) 1 (1 )
ck ck
ck ck ck
ck ck
ck
x y T x y T
T T x T
T x Ty
ck ck ck
P a l T h x y dxdy f x g y dydx
e dydx e dx e
T T T
λ λλλ λ
+ ≤ + ≤
−
− − −−
+ ≤ = =
= = − = − −
∫∫ ∫∫
∫ ∫ ∫
 
Consequently,  
1{ & } (1 )ckTck ck
ck
P a T a l T e
T
λ
λ
−< + > = −
      (5) 
Selecting Tck to cover short-lived errors. To mitigate checkpoint corruption, we want to select 
the checkpoint interval to be larger than the latency of most errors. To do that, it is crucial to get 
realistic estimates of error latency. How does one obtain such estimates in practice? Two methods 
come to mind: (i) analyze detection characteristics of detectors deployed in the system/application 
and (ii) inject faults into the target application/system to measure error latency. For example, 
according to Gu et al. [39], 95% of Linux kernel crashes have error latency of less than 100M 
CPU cycles (or 50 ms on 2G Hz processors). If we use this data in our model, then 
{ } 0.95BP l T p≤ = =  for TB=50 ms.   As { } 1 BTBP l T e λ−≤ = − , we get λ = 0.0599 (1/ms) for the 
test system in [39]. 
If we select 50 ms as the checkpoint interval Tck, i.e., a value covering 95% of error latency, the 
probability of checkpoint corruption is 31.7% (computed using formula (5)). For 
{ } 1 0.99BTBP l T e λ−≤ = − =  and λ = 0.0599, TB=77 ms. So when we increase the checkpoint 
interval to 77ms to cover 99% of error latency, the probability of checkpoint corruption is reduced 
 118 
to 21.5%. 
Dual checkpoint. Selecting the checkpoint interval to be larger than the latency of most errors 
reduces checkpoint corruption probability. However, it does not ensure that any error occurrence 
with latency less than the checkpoint interval does not corrupt a checkpoint. A dual-checkpoint 
scheme (as shown in Figure 26 (b)) is necessary to provide this assurance.23 In this scheme, two 
checkpoints are kept at any time, and the older of the two (chkpt1 in Figure 26 (b)) is rolled back 
to during recovery. In this scenario, chkpt1 is corrupted only when & 2ck cka T a l T< + > . Then the 
probability of checkpoint corruption is: 
{ & 2 } { 2 }ck ck ckP a T a l T P a l T< + > = + > 1 { 2 }ckP a l T= − + ≤  for [0, )cka T∈ . 
2 2
2
(2 )
0 0 0
{ 2 } ( , ) ( ) ( )
1 1 (1 ) 1 (1 )
ck ck
ck ck ck ck
ck ck
ck
x y T x y T
T T x T T
T x Ty
ck ck ck
P a l T h x y dxdy f x g y dydx
e
e dydx e dx e
T T T
λ
λ λλλ λ
+ ≤ + ≤
−
−
− − −−
+ ≤ = =
= = − = − −
∫∫ ∫∫
∫ ∫ ∫
 
Consequently,  
{ & 2 } (1 )
ck
ck
T
T
ck ck
ck
eP a T a l T e
T
λ
λ
λ
−
−< + > = −       (6) 
Table 14 lists the checkpoint corruption probabilities for different error latency percentiles in 
single-checkpoint and dual-checkpoint scenarios. When dual-checkpoint is used for Tck of 50 ms 
(covering latency of 95% of errors), the checkpoint corruption probability is largely reduced to 
1.59% (using formula (6)). 
                                                        
23
 Note that successfully taking the latter checkpoint implies that the system has successfully executed a period no 
less than TB since the former checkpoint, and the former checkpoint must not have been corrupted by an error with 
error latency less than TB. 
 119 
Without knowledge of error latency, imprecise selection of Tck may result in a large checkpoint 
corruption probability even when multi-checkpoint schemes are deployed. For example, if two 
checkpoints are kept but Tck is selected as 20 ms (70% of error latency in our example data), the 
probability of checkpoint corruption is 17.6%, much larger than the 1.59% when 95% of the error 
latency distribution is covered. 
Table 14: Checkpoint Corruption Probabilities in Different Scenarios 
Error latency 
percentile (p) 
p-percentile point of 
error latency at Tck 
(ms) 
Prob. of checkpoint 
corruption in 
single-checkpoint at 
Tck 
Prob. of checkpoint 
corruption in 
dual-checkpoint at Tck 
70% 20 58.3% 17.6% 
90% 38 39.1% 4.05% 
95% 50 31.7% 1.59% 
99% 77 21.5% 0.21% 
99.9% 115 14.5% 0.015% 
7.2. Availability Model 
In this section we construct a Markov model to study the availability improvement provided by 
VM-µCheckpoint to protected VMs. The model captures failure and recovery behavior of all the 
involved components. Specifically, the following failures are modeled: 
a) Transient failure of a protected VM (or an application in the VM): The VM is successfully 
recovered by the checkpointing VM if there is no checkpoint corruption. When there is 
checkpoint corruption, the VM cannot be recovered by VM-µCheckpoint, and a new VM is 
started on the same physical host. The jobs being executed at the time of the failure are 
resubmitted from the beginning. The checkpoint corruption probability derived from the 
checkpoint corruption model above is multiplied by the failure rate of a protected VM to obtain 
the rate of the failures with checkpoint corruption. The difference between the failure rate of the 
 120 
protected VM and the failure rate with checkpoint corruption is the rate of failures successfully 
recovered by VM-µCheckpoint.   
b) Transient failure of the checkpointing VM: The checkpointing VM is restarted from failure 
on the same physical host and begins to receive checkpoints from protected VMs. During this 
procedure, the protected VM is still available for job execution. When a protected VM fails during 
the failure/restart of the checkpointing VM, our recovery protocol first restarts the checkpointing 
VM and then restarts the protected VM (and as before, interrupted jobs are restarted from 
beginning). 
c) Failure of the hypervisor and permanent failure of the protected VM or checkpointing VM: 
A hypervisor is started either on the same physical node or on another node, the checkpointing 
VM is started on the hypervisor, and then the protected VM is started (we assume that the disk 
images of VMs can be loaded from any physical host, which is true in most current virtualized 
environments).  
Exponential distribution is assumed for the time to failure and the recovery time for all the 
components. For ease of explanation and brevity, here we present the availability model for only 
one protected VM on top of the hypervisor. A generalized model for n protected VMs on top of 
the hypervisor is described in Section 7.2.3. The following notations are used in the model: 
λv  Rate of the hypervisor software failure and all permanent failures.  
λs  Failure rate of the checkpointing VM alone. 
λp  Failure rate of the protected VM alone.  
 121 
rv  Restart rate of the hypervisor software.  
rs  Rate of restarting the checkpointing VM, including saving the first checkpoint.  
r’p  Rate of successfully recovering a protected VM by VM-µCheckpoint.  
rp  Rate of recovering a protected VM when VM-µCheckpoint fails to do that (due to checkpoint 
corruption). Job recomputation is considered as recovery overhead. 
pc  Probability of checkpoint corruption given an error. 
The Markov model in Figure 27 depicts the failure/recovery behavior of the system with 
VM-µCheckpoint deployed. The state of the system is denoted as a vector (k, j), where k 
represents the state of the protected VM and j indicates the state of the checkpointing VM (see 
Figure 27 for more information on the state representation). The model consists of two submodels: 
(a) the submodel for transient failures of the protected VM and the checkpointing VM and (b) the 
submodel for permanent failures and hypervisor failures. 
1,F 0,F
λp
1,1
rS
0,0
rSλS
0,1
pcλp
r’p
rp
λS
λS
FS
1, 1
0, 1
0, 0
1,F
0,F
(1-pc)λp
λV
λV
λV
rV
λV
λV
k, j
system state representation:
k=
protected VM 
available
1
protected VM failed0
j=
checkpoint in 
checkpointing VM
1
no checkpoint in 
checkpointing VM
0
checkpointing VM 
failed
F
FS physical node failed(a) (b)
 
Figure 27: Availability Model for One Protected VM: (a) Submodel for Transient Failures 
of Guest VMs and (b) Submodel for Permanent Failures and Hypervisor Failures 
Figure 27 captures all the failure/recovery behavior described above. For example, when a 
 122 
permanent failure occurs during execution of the protected VM, the entire physical node fails and 
a hypervisor is restarted on a physical node (shown as a sequence of transitions, (1,1)->FS->(0,F), 
in submodel (b)). Then the checkpointing VM and the protected VM (including interrupted jobs) 
are recovered in turn (shown as (0,F)->(0,0)->(1,1) in submodel (a)). The failure/recovery path in 
the model is highlighted in Figure 27. 
The Markov model is solved by computing the equilibrium condition, i.e., the “input flow” into 
each state equal to the “output flow” out of the state [46]. We use the mathematics tool package 
CLAPACK [47] to solve these equations and obtain the probability of the system’s staying in each 
state.  
7.2.1. Model of Disk-Based Checkpointing 
We extended VM-µCheckpoint to support disk-based checkpointing in Section 5.2.2. Then we 
also extended the availability model (Figure 27) to capture the behavior of recovering VMs from 
checkpoints in disk. The following notations are used in the model: 
λv  Rate of the hypervisor software failure and all permanent failures.  
λs  Failure rate of the checkpointing VM alone. 
λp  Failure rate of the protected VM alone.  
rv  Restart rate of the hypervisor software.  
rs  Rate of restarting the checkpointing VM, including saving the first checkpoint.  
rpm  Rate of successfully recovering a protected VM from in-memory checkpoint.  
 123 
rpd Rate of recovering a protected VM from in-disk checkpoint when in-memory checkpoint is 
not available.  
rpr  Rate of restarting a protected VM when in-disk checkpoint is not available. Job recomputation 
from the beginning is considered as recovery overhead. 
pcm Probability of corrupting an in-memory checkpoint by an error. 
pcd  Probability of corrupting an in-disk checkpoint by an error. 
The extended model is given in Figure 28. Figure 28(a) illustrates the failure and recovery of VMs, 
while Figure 28(b) illustrates the failure and recovery of the hypervisor or the node.     
 124 
k, j, m
system state representation:
k=
protected VM 
available
1
protected VM failed0
j=
checkpoint in memory 
of checkpointing VM
1
no checkpoint in memory 
of checkpointing VM
0
checkpointing VM failedF
node failed w/ checkpoint in disk 
(m=1) or w/o checkpoint in disk (m=0)
(a)
1,1,X
0,0,0
0,0,1
1,F,X
0,1,X
0,F,X
0,F,1
(1-pcm)λp
rpm
pcmλp (1-pcd)
rpd
pcmλppcd
rpr
λS
rS
0,F,0
λS
rSpcd
rS(1-pcd)
λS
rS
λp
λS
rS
m=
checkpoint in disk1
no checkpoint in disk0
don’t careX
Fs, m
 
(b)
0,0,1
0,F,1
Fs,1
λV
λV
rV
0,0,0
0,F,0
Fs,0
λV
λV
rV
1,1,X
0,1,X
1,F,X
0,F,X
(1-pcd)λV
(1-pcd)λV
(1-pcd)λV
(1-pcd)λV pcdλV pcdλV
pcdλV
pcdλV
 
Figure 28: Availability Model of VM-µCheckpoint with Extension to Support Disk-based 
Checkpointing, Illustrating Failure/Recovery of (a) VM and (b) Hypervisor or Node 
7.2.2. Availability Study 
After the model is solved, we obtain the availability of the protected VM by adding up the 
probabilities of the system staying in the states (1,1) and (1,F). To demonstrate the performance of 
our technique in availability enhancement, we also construct the Markov models for both the 
baseline case and an existing technique of VM checkpointing based on live migration (Remus 
[38]) for comparison. 
 125 
In the baseline case, a VM is on top of the hypervisor and there is no checkpoint of the VM. When 
a failure occurs to the VM, the VM is restarted with all interrupted jobs started from the beginning. 
Remus maintains a backup of a VM on a remote host by migrating the state from the primary to 
the backup periodically (e.g., every 50 ms). When the VM fails, the backup VM begins to execute 
from the last checkpoint. The detailed behaviors of the baseline and Remus as well as their models 
are presented in Sections 7.2.4 and 7.2.5.  
Model parameters. The parameters selected in our model are based on previous empirical study 
of off-the-shelf servers. The availability study for Windows servers [48] reports that, though the 
average availability of the servers is around 99.9% (without considering job recomputation), there 
are also servers with availability around 99% or less. The authors also report the MTTR (mean 
time to recovery) for all the failures as 0.25 hour (or 15 minutes). This MTTR includes the 
response time of an administrator who discovers the failure and restarts the failed machine with 
appropriate recovery. 
So in the availability model for VM-µCheckpoint we set the following parameters: 
rv = rs = 1/15 min. 
rp = 1/(0.5*average job duration), because the VM is restarted immediately and the mean job 
recomputation during recovery is half of the average job duration.  
r’p = 1/600 ms = 100, as overhead around 600 ms is measured in our experiments.  
λv = 1/15000 hours + 1/3 years = 0.000001492/min. 1/15000 hours = 1/1.712 years is the 
hypervisor failure rate; 1/3 years is the permanent failure rate (according to presentations 
made by several Intel engineers in DARPA and other forums). 
λs = 1/15000 min, for the checkpointing VM with 99.9% availability (rs = 1/15 min). 
λp = 1/1500 min (for a protected VM with 99% availability without considering job recomputation) 
or 1/15000 min (for a protected VM with 99.9% availability).  
 126 
pc = 1.59%; the value is derived in the checkpoint corruption model for Tck= 50 ms, which covers 
95% of error latency. 
In this parameter selection, λv is much smaller than λs or λp because hardware and hypervisor are 
usually assumed to be much more reliable than the server software and the operating system. This 
is a realistic assumption (also assumed in [49]) because the hypervisor kernel is small (e.g., 
434KB for Xen-3.3.1 vs. 1.5MB for Linux 2.6.18), and hence, verification and test of the 
hypervisor code is relatively easy.  
The parameter values above are also used in the availability models for the baseline and Remus 
(so the recovery rate of Remus is also 1/600 ms), except that a different pc value is selected in the 
model for Remus. For the example data used in Chapter 3, the checkpoint corruption probability 
is 31.7% in a single-checkpoint scheme at the checkpoint interval of 50 ms, if the exponential 
distribution of the error latency is assumed. According to an experimental study in [50], the 
probabilities of checkpoint corruption range from 27% to 41% for different application workloads. 
We select pc=15% in the Remus model for fair comparison (note that pc is probability of 
checkpoint corruption given an error).  
Results. The availability values computed from these models are compared in Table 15. Our 
results are better than Remus’s for all the experiment cases. For example, for average job duration 
of 8 hours (i.e., 1/rp = 240 min) on a 99%-available server (λp=1/1500 min), we achieve an 
availability of 99.7%, while Remus achieves 97.7%. There are two reasons for our better results:  
i) Transient failures are much more frequent than permanent failures. So the Remus’s 
capability of tolerating permanent failures via a backup copy on a remote host demonstrates only 
slightly more availability than our technique, especially for jobs lasting a couple of hours or less. 
Table 16 shows the availability results if there is no checkpoint corruption in either model. For an 
 127 
average job duration of 8 hours on a 99%-available server, the availability of the protected VM is 
increased from 99.94% to 99.996% by introducing the remote host for tolerating permanent 
failures. 
Table 15: Availability Comparison with Checkpoint Corruption (note that 1/rp = 
0.5*average job duration) 
 1/rp (min) 15 60 240 1440 
VM-uchkpt 99.98% 99.92% 99.7% 98.2% 
Remus 99.8% 99.4% 97.7% 87.4% 
λp=1/1500min 
Baseline 99.0% 96.1% 86.2% 51.0% 
VM-uchkpt 99.99% 99.98% 99.93% 99.6% 
Remus 99.98% 99.94% 99.76% 98.6% 
λp=1/15000min 
Baseline 99.90% 99.6% 98.4% 91.1% 
ii) The impact of checkpoint corruption on availability is much larger than that of permanent 
failures in high-frequency checkpointing. VM-µCheckpoint loses little in handling permanent 
failures, but it gains much more in reducing checkpoint corruption by (a) selecting the proper 
checkpoint interval to cover latency of 95% of errors and (b) applying a dual-checkpoint scheme 
to provide an assurance of 100% coverage for these 95% of errors (another 5%-1.59%=3.41% of 
rest of the errors are also covered, according to our model).  
Table 16: Availability Comparison without Checkpoint Corruption 
 1/rp (min) 15 60 240 1440 
VM-uchkpt 99.99% 99.98% 99.94% 99.7% 
Remus 99.997% 99.997% 99.996% 99.992% 
λp=1/1500min 
Baseline 99.0% 96.1% 86.2% 51.0% 
VM-uchkpt 99.993% 99.99% 99.96% 99.8% 
Remus 99.998% 99.998% 99.998% 99.997% 
λp=1/15000min 
Baseline 99.90% 99.6% 98.4% 91.1% 
7.2.3. Generalized Availability Model for VM-µCheckpoint 
We generalize the availability model of VM-µCheckpoint, as described in Figure 27, to cases 
when there are n VMs on top of a hypervisor. The state of the system at any time is denoted as a 
 128 
vector (k, j) in the generalized model, where k is the number of the n protected VMs that are 
available and j is the number of the n protected VMs whose checkpoints are kept in the 
checkpointing VM. The system stays in the state (n, n) in the failure-free situation.  
The generalized model consists of a number of submodels, illustrated in Figure 29.  
FS
rV
k,j FS
λV
k,F FS
λV
0,F
m, F m-1,F m-2,F 0, F…
m, m m-1,m m-2,m 0, m…rp rp rp rp
λSλS λSλS
k,F
k,m
λS
…
…
n, F n-1,F n-2,F 0, F… k,F …
nλp (n-1)λp (n-2)λp λp
n-1,n-1
rS
n-2,n-2
rS
k,k
rS
0,0n,n
rS rS
rp rp rp
…
rp
…
(a)
(b)
(c)
m = 0, 1, ?  n
j = 0, 1, ?  n
k = 0, 1, ?  j
m-1,m-1 m-2,m-1 m-3,m-1 … k-1,m-1 …
(1-pc)mλp (1-pc)(m-1)λp (1-pc)(m-2)λp (1-pc)λp
pcmλp pc(m-1)λp pc(m-2)λp pckλp
 
Figure 29: The Generalized Availability Model for n Protected VMs, with (a)-(c) 
Illustrating Submodels 
The submodels are as follows: 
a) Those that capture the transient failure and recovery of protected VMs when the 
checkpointing VM is available, as well as the failure of the checkpointing VM. Figure 29 (a) 
illustrates such a submodel for those states where the checkpointing VM keeps checkpoints for m 
 129 
protected VMs. In this figure, we can see that the state (m, m) transitions to the state (m-1, m-1) 
when there is checkpoint corruption associated with the failure, while it transitions to the state 
(m-1, m) when there is no checkpoint corruption. There are n+1 such submodels for capturing the 
states with the m value ranging from 0 to n.   
    b) The submodel that captures the transient failure and recovery of protected VMs when the 
checkpointing VM is unavailable, as well as the recovery of the checkpointing VM (Figure 29 
(b)).  
    c) The submodel that captures permanent failures and hypervisor failures, as well as the 
corresponding recovery (Figure 29 (c)). 
So there are n+3 submodels in the generalized availability model in Figure 29. We employ the 
CLAPACK tool package to compute the equilibrium condition for all the model states and solve 
the model. 
7.2.4. Availability Model for Baseline 
To evaluate the availability improvements VM-µCheckpoint provides, we also model the baseline 
case for comparison purposes (shown in Figure 30). The model captures scenarios with n virtual 
machines on top of a hypervisor. There is no checkpointing of VMs. When a failure occurs to a 
VM, the VM is restarted, with all interrupted jobs started from the beginning.   
 130 
n n-1 n-2 0
FS
…
nλp (n-1)λp (n-2)λp λp
rp rp rp rp
λV rV
λV λV λV
k VMs available 
(k=0, 1, 2, ?  n)
FS physical node failed
k
 
Figure 30: The Availability Model for Baseline (n Protected VMs) 
7.2.5. Availability Model for Remus 
Remus maintains a backup of a VM on a remote host by migrating the state from the primary to 
the backup periodically. When the VM fails, the backup VM begins to execute from the last 
checkpoint. We design a Markov model for Remus to compare the availability enhancements 
provided by VM-µCheckpoint and Remus. 
Due to the complexity of the Markov model for Remus, here we only describe the model for the 
cases in which there is only one VM on top of a hypervisor. Figure 31 presents this model and 
explains how the system state is represented in the model. The p’c in the figure denotes the 
probability of checkpoint corruption in Remus. The following paragraphs briefly describe part of 
the model (the highlighted states and transitions in the figure) to illustrate how the system 
behavior is modeled. 
 131 
1,1 0,1
r’p
F,1
λV
rV
1,F
λV
1,0
rV
r’p
0,0
λp
rp
0,F
λV
rV
λV
F,0
λVrV
λV
λp
F,F
λV
λV
λp
λV
λV
λV
λp
λV
rV
p’cλp
(1-p’c)λp
k, j
system state representation:
k: the state of the 
primary machine
k, j=
protected VM  
available
1
protected VM 
failed
0
physical node 
failed
F
j: the state of the 
backup machine
 
Figure 31: The Availability Model for Remus (1 Protected VM) 
The system stays in the state (1, 1) during normal behavior. When a failure occurs to the VM on 
the primary host and that failure does not cause checkpoint corruption, the backup host resumes 
the execution of the VM from the last checkpoint immediately. As a result, the system transitions 
to the state (0, 1). We assume that the previous primary host automatically sets up a backup VM, 
i.e., begins to collect a checkpoint of the current executing system after the failover, at the rate r’p. 
The previous primary host is the backup host now, and the state (0, 1) transitions to (1, 1) at the 
rate r’p. 
If a VM failure causes checkpoint corruption, the backup host cannot resume the execution of the 
VM successfully. So the state (1, 1) transitions to the state (0, 0). At this time, the VM is restarted 
on the primary host and all the jobs in the VM are restarted from the beginning at the rate rp. The 
system state becomes (1, 0). The backup host begins to collect a checkpoint of the VM from the 
primary host. So the state (1, 0) transitions to the state (1, 1).  
When a permanent failure occurs to the primary host, the system transitions from (1, 1) to (F, 1), 
in which state the VM is still available in the backup host. Note: an independent failure may occur 
to the backup host at the same time. So the state (1, 1) independently transitions to the state (1, F) 
 132 
at the rate λV. Remus tolerates a single permanent failure, as the VM is available in (F, 1) and (1, 
F).  
The probability of the system staying in each state of the model is obtained by solving the 
availability model. The system availability is computed as the sum of the probabilities of the 
system staying in all states (k, j) where either k or j is 1. 
 
 133 
CHAPTER 8 
RELATED WORK 
 
This dissertation consists of a number of topics including the Reliability MicroKernel framework, 
system hang detection, application/VM crash detection, and VM checkpoint, as well as 
experimental evaluation and formal model based analysis of these designs and implementations. 
Here we give the related work on the RMK, system hang detection, and VM checkpoint, which 
are the main contributions of this dissertation.  
8.1. Related Work on RMK and System Hang Detection 
Table 17 summarizes representative examples of studies (in academia and industry), and systems 
(commercial, and research prototypes) that address issues of providing reliability services to 
applications using hardware, and system support. Microkernel systems such as Mach [51] and 
Chorus [52], [53] provide basic resource management and communications rather than explicitly 
focus on reliability. Reliability architectures in IBM AIX [1], and High-Availability Linux [2] are 
designed to be closely coupled with the systems. They mostly concern system reliability rather 
than exploiting application characteristics to improve application reliability. Hardware reliability 
frameworks, such as the Reliability and Security Engine (RSE) [54], provide application-aware 
mechanisms using programmable hardware modules to support the detection/recovery of runtime 
errors.  
 134 
Table 17: List of Work Related to RMK 
Category Study/System Reliability Features Comments 
Mach [51] Reliability is not the primary focus. 
Architectural basis to build 
other OSs.  
Microkernel 
Architecture Chorus [52] 
[53] [55]  
Reconfigurable microkernel; 
applies event/exception 
handling for reliability; 
predicates are defined in 
wrappers of functional services 
to guard against incorrect state 
and error propagation. 
Architectural basis to build 
other OSs. Wrappers of 
microkernel services used 
to check system health 
rather than support 
application-aware 
techniques.  
IBM AIX [1] 
Virtual-server-based system 
protection; application hang 
detection; a daemon polls the 
OS kernel to check if processes 
are being starved. 
OS-level support focused 
on system reliability. 
SGI IRIX 
Process checkpointing dumps 
process image to disk. 
Needs OS-level support for 
preserving the entire 
process image. 
High-Availa- 
bility Linux [2] 
Heartbeat used for detection of 
node failures; membership 
protocol used for group 
communication. 
Needs support for system 
reliability, especially 
cluster health. 
Sentry [56] 
Additional layer placed on top 
of OS provides error masking, 
and system service guard. 
Supports system reliability 
rather than application. 
Reliability 
Architecture 
ARMOR [35] 
Self-checking middleware 
provides fault tolerance to 
applications. 
Application 
instrumentation is required 
for implementing 
application-specific 
reliability techniques. 
RSE [57][54] 
Processor-level framework 
provides application-aware error 
detection, e.g., detection of OS, 
and application hangs using 
hardware modules. 
Reliability mechanisms are 
bound to hardware 
modules; needs OS support 
for application hang 
detection.  
Watchdog 
Device [17] 
System hang detection: an 
external PCI card is used as 
watchdog. 
Extra hardware is required; 
long latency (timeout 
sometimes in minutes). 
Hang 
Detection of 
OS or 
Application
s 
KHM [16] 
System hang detection: a 
process periodically sets a mark, 
and the timer interrupt handler 
checks the mark to detect system 
hangs. 
Fails when interrupt is 
disabled; long latency 
when system is heavy 
loaded; large overhead. 
 
 135 
Table 17: Continued 
 Linux NMI 
Watchdog 
[18] 
System hang detection: if there is 
no timer interrupt within a few 
seconds, the system hang is 
detected. 
Fails when the system 
hangs with timer interrupts 
arriving and handled. 
Libckpt [9], 
LibFT [14] 
Libraries invoked by applications 
dump application states and/or 
critical data. 
Not 
application-transparent; 
user-level knowledge is 
required for high 
performance.  
Zap [58] 
Virtual machine solution 
transparently checkpoints 
applications. 
Checkpoints the entire 
process image. 
Epckpt [59] 
New system calls added to the 
kernel provide transparent 
checkpoint. 
Checkpoints the entire 
process image. 
Incremental 
Checkpointin
g [60] 
Application-transparent 
incremental checkpoints are 
provided for main memory 
database. 
Application-used library is 
instrumented to support 
incremental checkpoints; 
custom solution for 
MMDB applications.  
Cache-based 
Checkpointin
g [61] 
Checkpoint is triggered on cache 
misses; the checkpoint state is 
stored in memory and includes 
processor registers and cache. 
Checkpoint interval is very 
short; error may propagate 
across checkpointing; large 
overhead; additional 
hardware. 
Coordinated 
Checkpointin
g [62] [63] 
Checkpoint protocols are 
provided for distributed 
applications.  
Checkpoints the entire 
process image; focuses on 
checkpoint protocol. 
Shadow 
Process 
Checkpointin
g [10] 
Shadow process is forked to store 
the process image upon a 
checkpoint; copy-on-write is 
applied for dirty page copying. 
Important process states 
like pid not preserved; 
inefficient copying of pages 
with only one/two writes; 
unnecessary allocation of 
process resources. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Application 
Checkpoint-
ing 
Compiler-Ass
isted 
Checkpointin
g [64] 
Compiler inserts checkpoint 
functions in programs according 
to user-provided directives for 
checkpointing critical data. 
Not 
application-transparent; 
user-level directives are 
required. 
Most existing techniques for detecting hangs of the OS, and/or applications are timer-based; and 
the timeouts are fixed values that are either preset, or derived from profiling. As a result, the 
timeout values are often very conservative, and cause large detection latency. Some existing 
 136 
detection mechanisms cannot operate when the interrupts are disabled due to system hangs, e.g., 
KHM [16]. Others, like the Linux NMI watchdog timer [18], do not detect hangs if the timer 
interrupt continues to function despite the system hang. 
As there are many checkpoint techniques, we do not list all of them in the table. Some checkpoint 
mechanisms preserve the entire process image for failure recovery, or process migration [59], [58], 
while others provide incremental checkpointing [9], [14], [61], [10]. Libckpt [9] checkpoints dirty 
pages of a process. Mechanisms in [9] and [14] allow users to specify critical data for 
checkpointing; these require application instrumentation. While an enhanced compiler can be 
used to automate the application instrumentation, user-provided directives are still needed to 
identify the location (in the application code) where the instrumentation should be added [64]. 
Cache-based checkpointing [61] needs additional hardware to enable storing of 
application-relevant states. More importantly, none of these checkpointing schemes uses OS-level 
knowledge to avoid inconsistency between the application process image, and the corresponding 
system state.  For example, a checkpoint taken when the target application has pending I/O 
operations may be inconsistent with the I/O status at the time of recovery upon failures.  Our 
checkpointing scheme solves this inconsistency. 
8.2. Related Work on VM Checkpoint  
Checkpoint and rollback techniques have been extensively studied in the literature. Checkpoints 
can be taken in different levels (application, runtime library, compiler, operating system, virtual 
machine, or hardware). Here we focus on checkpoint techniques in the virtual machine level, as 
they are more relevant to our objective. 
 137 
Table 18: The Three Categories of Existing Mechanisms for VM Checkpoint 
Mechanisms Brief Description Comments 
Stop-and-save Stops a VM completely, 
and saves its state to 
persistent storage 
i) large system downtime;  
ii) provided by all major VMM systems 
Low-freq (e.g. interval 
of hours or longer) 
based on live migration, 
e.g. CEVM, VNsnap, 
VM snapshots 
Creates a VM replica on 
a remote node via live 
migration, then the 
remote node writes the 
replica to disk 
i) Significant recomputation during recovery as 
checkpoint frequency is low;  
ii) Large overhead of maintaining a full replica for 
a protected VM 
High-freq (interval of 
10~1000 ms) based on 
live migration, e.g. 
Remus 
Maintains a VM replica 
on a separate physical 
node via live migration, 
and fail-overs upon a 
failure 
i) Large overhead while migrating latest updates to 
the remote node continuously (around 50% 
overhead for 50ms checkpoint interval);  
ii) maintaining a full replica for a protected VM;  
iii) fail-stop assumption 
 
VM checkpointing. Table 18 lists the existing mechanisms of VM checkpoint. Basically there are 
three categories of VM checkpoint: (i) the traditional stop-and-save way of VM checkpoint, (ii) 
low-frequency VM checkpoint based on live migration, and (iii) high-frequency VM checkpoint 
based on live migration.  
In the first category of VM checkpoint, we stop a VM completely and save its state in persistent 
storage, and then resume the VM. This way incurs a large system downtime during the checkpoint. 
Then live migration is leveraged to avoid this large downtime. 
Most existing VM checkpoint/ replication techniques are based on live migration of VMs (e.g., 
VMWare VMotion [65] and Xen Live Migration [3]), which continually transmit dirty pages of a 
VM from a source node to a destination node. These techniques exploit the live migration 
mechanism for the purposes of VM checkpointing, VM rejuvenation, load-balancing, and fast 
VM forking.  
CEVM [36], VNsnap [37], and VM Snapshots [66] are techniques of the second category of VM 
 138 
checkpoint in Table 18. These techniques employ VM live migration or copy-on-write to create a 
replica image of a VM with low downtime incurred; then the image is written to disk in the 
background or by the separate physical node. An ongoing project on VM checkpoint [66] tries to 
provide a generic API in Xen product for saving a VM snapshot to disk on demand. Basically, the 
VM memory is scanned and saved to files while the VM runs simultaneously. Copy-on-write is 
exploited to save the original data of modified VM state during checkpointing. 
VM-µCheckpoint is different from these disk-based VM checkpointing schemes in that we aim at 
(i) providing high-frequency checkpointing and rapid recovery of VMs with low overhead, which 
allows VM failures to be masked to clients, and (ii) proposing a mechanism to alleviate 
checkpoint corruption in high-frequency checkpointing (checkpoint corruption has more impacts 
on service availability than permanent failure, as shown in our model). Disk-based VM 
checkpointing is too costly and is unable to keep up with the high frequency of rapid 
checkpointing (tens per second). 
The third category of VM checkpoint is the high-frequency VM checkpoint based on live 
migration. The existing approach in this category is Remus [38], which maintains a backup VM 
on a separate physical node by periodically transmitting the VM’s dirty pages to the backup. 
Similar to VM-µCheckpoint, Remus is a mechanism of high-frequency VM checkpointing and 
failover. But VM-µCheckpoint focuses on error behavior and reliability/availability improvement, 
while Remus focuses on migration overhead. No study of error behavior or reliability/availability 
is reported in [38]. As checkpoint corruption is not handled in Remus (fail-stop errors are 
assumed), our technique is better in improving service availability. Moreover, there is large 
overhead in high-frequency migration-based VM checkpoint because the latest state updates are 
migrated to the remote node continuously. For example, our checkpoint algorithm incurs around 
 139 
6.3% overhead at the checkpoint intervals of 50 ms, while Remus incurs around 50% overhead for 
the same checkpoint frequency. 
Other techniques that may be relevant to VM checkpointing are briefly described as follows. 
Bradford et al. [67] focus on migrating persistent state of a guest system across WAN so that the 
guest system can migrate to a node that does not share storage with the original node. [68] revises 
Xen live migration to fit in a self-migration scenario. Distler et al. [69] and Nagarajan et al. [70] 
implement proactive VM rejuvenation based on live migration, and [71] uses live migration for 
load-balancing. Potemkin [72] employs copy-on-write to share data between VMs for efficiently 
provisioning VMs. Another technique of fast VM forking is [73]. Though one may use these 
techniques to checkpoint a VM by periodically forking a shadow VM and tearing down out-dated 
shadow VMs, spawning a VM and tearing down a VM involve a lot of overhead that is not 
necessary for checkpointing. Moreover, error analysis and reliability/availability study is an 
integral part of a checkpointing technique for failure mitigation. 
Multi-checkpoint mechanisms. As far as we know, none of the existing checkpoint techniques 
considers handling checkpoint corruption by explicitly including error latency bound as a 
parameter, though multi-checkpoint mechanisms can be leveraged to deal with checkpoint 
corruption. IBM System Z [74] allows multiple checkpoints of an application to be recorded in 
persistent storage on demand. Ping-Pong checkpoint [75] maintains two checkpoints to deal with 
incomplete checkpoint due to errors during the checkpointing procedure, rather than checkpoint 
corruption due to latency of error detection.  
In-place restoration. Hardware-level checkpoint techniques [76], [77] use special hardware to 
take and store a checkpoint. When an error is detected, the checkpoint saved in the special 
 140 
hardware is restored into the architecture state of the physical machine, including register file and 
memory. For example, the first update of a memory word or a register during a checkpoint 
interval is preserved in special hardware in SafetyNet [77]. 
 
 
 141 
CHAPTER 9 
CONCLUSIONS 
 
The thesis describes the Reliability MicroKernel (RMK) framework, a loadable kernel module for 
providing application-aware reliability and configuring reliability mechanisms. The RMK 
prototype has been implemented in native systems of Linux and Windows on a Pentium 4 
processor without recompiling the kernel, and is also extened to the Xen VMM to provide highly 
available VMs (RMK is encapsulated into a hypercall in Xen, and the Xen hypervisor is 
recompiled but VMs are not recompiled). By combining the low-latency error detection and 
transparent checkpoint techniques into the RMK framework, we provide an automatic approach to 
error detection and recovery for real-world systems. 
The RMK on native systems supports the detection of application/system failures, and transparent 
application checkpointing. The experimental evaluation of the RMK using real-world 
applications shows that the system hang detection, and application hang detection, which exploit 
characteristics of application and system behavior, achieve high coverage, and low false-positive 
rates (1 out of 2000 experiments for application hang detection, and no false positive for system 
hang detection). Because the OS-level knowledge of applications/system is used, the RMK 
prototype has a low overhead in providing the transparent application checkpoint (less than 0.1% 
in the experiments), and application failure detection (0.6% performance overhead).  
The hang detection of VMs is adapted from the system hang detection in native systems. 
Moreover, we proposed VM-µCheckpoint, a lightweight VM checkpointing technique, which 
 142 
minimizes overhead by placing checkpoints in memory and performing in-place recovery. 
VM-µCheckpoint also addresses the problem of checkpoints becoming corrupted in 
high-frequency checkpointing. We showed that it is important to take into account expected 
durations for errors to manifest themselves in order to determine checkpoint intervals.  
A model-based study was conducted to show that VM-µCheckpoint achieves better results than 
existing migration-based VM checkpointing. For example, for an average job duration of 8 hours 
in a 99%-available VM (on top of a hypervisor with the MTTF of 1.7 years), we achieve an 
availability of 99.7%, while the migration-based VM checkpointing achieves 97.7%.  
Experimental results show that the proposed technique achieves much better performance than 
existing techniques based on VM live migration. There is an average of 6.3% overhead in terms of 
program execution time for the SPEC CINT 2006 benchmark when VM-µCheckpoint is deployed 
at a checkpoint frequency of 20 times per second. (An approximately 50% overhead is reported in 
a previous technique [38] at the same checkpoint frequency.) Moreover, the checkpoint size is 
small in VM-µCheckpoint: an average of 2.6 MB in our experiments when the CoW-P algorithm 
is applied with 50 ms checkpoint intervals. 
 
 143 
APPENDIX A 
MODELING COORDINATED CHECKPOINTING FOR 
LARGE-SCALE SUPERCOMPUTERS 
 
A.1. Introduction 
The computational demands of emerging applications such as protein folding is giving rise to a 
new generation of supercomputers, consisting of several thousand processors (currently in 
planning). For example, the newly deployed IBM BlueGene/L [78] is expected to scale to 64K 
dual-processor nodes. Despite the huge computing power they provide, the large number of nodes 
makes the system significantly more vulnerable to errors. As a result, the effects of the larger 
number of failures due to errors can impair the system performance and limit its scalability.  
Although a hierarchy of error detection and recovery techniques, like ECC, CRC and message 
retransmission, can be used to correct some errors/failures, there are transient errors/failures that 
cannot be covered using these techniques, e.g. the corrupted states due to propagation of 
undetected errors. For these errors/failures, checkpointing and rollback may be used as the last 
option to recover the application before rebooting or reconfiguring the system. This work focuses 
on errors/failures that need checkpointing and rollback to recover.  
The most commonly used checkpointing scheme for supercomputing systems is coordinated 
checkpointing, due to its simplicity of implementation. In this approach, cooperative processors 
synchronize to ensure a global consistent state before taking a checkpoint [79].  The main 
 144 
problem with coordinated checkpointing is its lack of scalability as it requires all processors to 
take a checkpoint simultaneously.  
There are two main contributions made by this work. First, it builds a model of a large-scale 
system that uses coordinated checkpointing for recovery from failures with complex semantics. 
Second, it studies the scalability and performance of the system for several hundred thousand 
processors by simulating the model with realistic parameter values. 
An important issue considered in our model is the effect of scaling from several thousand 
processors to several hundred thousand processors, i.e. by two orders of magnitude. Issues such as 
failures during checkpointing and recovery, correlated failures within the system, and 
checkpointing overhead due to coordination, are of primary importance for the new generation of 
supercomputers. This is because their larger number of nodes and higher failure rates invalidate 
some of the assumptions about system behavior made by existing models [80], [81], [82], [83], 
[84], [85] and exacerbate some effects considered negligible before. These assumptions are:  
• The computation interval and the checkpoint overhead are much smaller compared to the 
mean time between failures (MTBF). However, large-scale supercomputers experience much 
smaller MTBFs and much larger checkpoint overheads, and hence failures during checkpointing 
and recovery can occur and must be taken into account [86].  
• Failures are independent of each other. This is not a valid assumption, as Tang and Iyer [87] 
showed that even a small number of correlated failures increase system unavailability 
considerably. 
 145 
• The overhead of inter-processor coordination for checkpointing is negligible. However, as the 
number of nodes increases the coordination overhead grows and it cannot be ignored. 
A measure called useful work similar to accumulated reward [88] is used to evaluate system 
performance. It is defined as the computation that contributes to the ultimate completion of the job 
(see definition in Section A.8). If a failure occurs before the computation can be checkpointed, the 
computation since the last checkpoint needs to be repeated after the recovery, and is not counted 
as useful work. Accurate modeling of useful work requires knowledge of future behavior of the 
system and cannot be represented using simple Markov models. Instead, Stochastic Activity 
Networks (SAN) are used to model the system behavior. The modeling power of SANs allows us 
to concisely represent complex system phenomena such as checkpoint coordination, failures 
during checkpointing and recovery, and correlated failures. The SAN model is studied using 
simulation and the impact of system parameters on system performance and scalability is 
evaluated. 
A.2. Related Work 
Checkpointing Models. One of the earliest models for computing the optimal checkpointing 
interval is by Young [80]. This model assumes that the MTBF of the system is very large 
compared to the checkpoint and recovery time, and hence does not consider failures during 
checkpointing and recovery. Daly [81] presents a modification of Young’s model for large-scale 
systems. This model takes into account failures during checkpointing and recovery, and multiple 
failures in a single computation interval. However, it does not model the coordination overhead of 
the checkpointing protocol itself or consider correlated failures.  
 146 
Kavanagh and Sanders [82] evaluate two time-based coordinated checkpointing protocols based 
on analytical and simulation models, which take the overhead of coordination into account. 
However they do not consider failures during checkpointing and recovery as they assume that the 
MTBF of the system is much greater than the checkpoint interval.  
Plank and Thomason [83] investigate the use of spare nodes to provide redundancy in the system 
to handle permanent failures. We do not consider permanent failures in our model and assume that 
all nodes can be recovered by restarting the system from the last-saved checkpoint. Plank and 
Thomason do not consider the overhead of coordination in the model or the effect of scaling their 
model to a large number of nodes. A recent paper by Elnozahy et al. [84] extends the work of 
Plank and Thomason to systems consisting of thousands of nodes. It considers the effects of 
failures during checkpoint and recovery and multiple failures in a single computation interval. 
However, it does not consider the effects of coordination among the nodes in the checkpointing 
protocol, nor does it consider correlated failures.  
Vaidya [85] derives an analytical expression for the optimal checkpointing frequency in a 
uniprocessor system. It distinguishes the checkpoint latency from the overhead of a checkpointing 
scheme. This model considers failures during checkpointing/recovery but does not take into 
account the scalability of the checkpointing protocol or the system. 
Large-Scale Systems. Bronevetsky et al. [89] presents a compiler-based technique for 
asynchronous, coordinated checkpointing. Agarwal et al. [90] consider an adaptive, incremental 
checkpointing technique for scientific applications on large-scale systems. Finally, Zhang et al. 
[91] extensively study failure data analysis in large-scale supercomputing systems and show the 
 147 
existence of temporal and spatial correlation among failures in large-scale systems. We consider 
temporal correlations in our model (correlated failures) but not spatial correlations. 
A.3. Target System  
This study focuses on a typical abstract structure commonly shared by many supercomputers and 
a basic coordinated checkpointing protocol whose variants are applied in the supercomputing 
world.  
A.3.1. Architecture  
Each node of the supercomputing system is a tightly-integrated unit consisting of multiple 
processors. For example, Blue-Gene/L has 2 processors per node and ASCI Q has 4 processors 
per node. Future systems could have 8, 16 or 32 processors per node. 
Usually large-scale supercomputing systems have dedicated nodes for job computation (compute 
nodes) and for I/O operations (I/O nodes). A set of compute nodes shares the connections to an 
I/O node, and all the I/O nodes are connected to a parallel file system through a separate 
connection network. For example, IBM BG/L has 64K compute nodes and 1024 I/O nodes. The 
network bandwidth from 64 compute nodes to one I/O node is 350 MB/s, and the bandwidth from 
one I/O node to the file system is 1 Gb/s. 
Data writes from compute nodes to the file system are performed in two steps: first from compute 
nodes to I/O nodes, and then from I/O nodes to file system. The I/O nodes locally buffer the 
application data or checkpoint they receive from the compute nodes, and write it to the file system 
in the background while the compute nodes continue with the computation. The two steps are 
reversed for data reads with the exception that reads cannot be done in the background, as the 
 148 
application may have to wait for the data to be read before proceeding, depending on the nature of 
the read.24  
A.3.2. Checkpoint Protocol 
There are two checkpointing approaches used in supercomputing systems. One is 
application-based, where a global barrier is explicitly used in the application for saving a global 
consistent state. This places the burden of checkpointing on the application (e.g. in BlueGene/L 
[78]). The other approach is system-supported checkpointing (e.g. the algorithm used by Cray in 
the IRIX OS [92]). Our checkpointing protocol is a system-supported synchronous checkpointing 
and follows the basic principles of coordinated checkpointing, e.g. Koo and Toueg’s protocol 
[93]. 
In our protocol, a single coordinator node, or master, periodically initiates the checkpointing as 
follows: 
(1) The master broadcasts a ‘quiesce’ request to all the compute nodes. 
(2) On receiving ‘quiesce’ each node quiesces its operations, i.e. stops all its activities at a 
consistent and interruptible state, and replies ‘ready’ to the master. 
(3) After receiving ‘ready’ from all the compute nodes, the master broadcasts ‘checkpoint’ to all 
the compute nodes. 
(4) On receiving ‘checkpoint’ each compute node dumps its state to an I/O node. 
(5) When all the compute nodes are done dumping their states, the master broadcasts ‘proceed’ 
to all the compute nodes, and the I/O nodes begin to write the checkpoint to the file system in 
the background. 
(6) On receiving ‘proceed’ each compute node continues its activity from the point where it was 
quiesced. 
Further, a timeout period is specified at the master to avoid waiting indefinitely for the ‘ready’ 
responses. This indefinite wait can occur as a result of an erroneous or failed node that does not 
respond to the quiesce request. If all the responses are not received within this time, the master 
                                                        
24
 While current supercomputing systems may not have this capability, future systems might allow this two-step I/O. 
 149 
times out and broadcasts a ‘abort’ message to all the compute nodes. Then the compute nodes 
abandon the checkpointing and proceed with their computation. 
A.3.3. Application 
The application is a parallel, scientific computing workload composed of multiple computation 
tasks. Each compute processor runs exactly one task of the parallel application and no other tasks.  
Application tasks may be performing computation, communication or I/O at any time. Since most 
parallel, scientific applications are written using the BSP (Bulk Synchronous Parallel) model [94], 
the multiple tasks more or less coordinate their actions and behave as one cohesive unit.  
The application is instrumented with a number of checkpoint primitives at its safe points (e.g. a 
global barrier), where it can be safely quiesced, like the end of a loop. For example in IRIX, the 
programmer inserts checkpoint functions in the source code, and the OS calls these whenever it 
wants to take a checkpoint. 
A task that is doing an I/O write, cannot quiesce until it finishes the I/O operation, as this could 
leave the I/O in an inconsistent state and possibly corrupt the file system. While there are methods 
to address this, ensuring global coordination is complicated and the simple approach of 
non-preemptive I/O is preferred in practice. I/O reads of a task can be stopped for checkpointing 
at any time, and hence, are not specifically considered in our model. 
 150 
A.3.4. Failure and Recovery 
On the failure of a compute node, the entire application rolls back to the last saved checkpoint and 
recovers; i.e., we only consider failures that require recovery from a checkpoint. Therefore, 
permanent/persistent errors are not modeled. 
Failures of compute nodes and I/O nodes are always detected without any latency. The 
mechanism for failure detection is not modeled. 
When an I/O node fails, all the I/O nodes need to be restarted. This assumption is reasonable since 
in the BSP model, the application needs the I/O operations on all the I/O nodes to be completed 
before continuing the computation. 
Failures of the master are not treated differently from a compute node failure, as the master is just 
another compute node in a real system, and perfect detection is assumed. 
As nodes have multiple processors, the node failure rate is the product of the processor failure rate 
and the number of processors per node. The system parameter MTTF is used to refer to the 
per-node mean time to failure throughout this appendix unless specified otherwise. Then 
per-processor MTTF is MTTF*number of processors per node. It is assumed that advanced design 
and error handling techniques are applied to maintain low node failure rates, e.g. use of multiple 
cores on a chip. 
As there is no consensus on MTTF in the literature, we assume an MTTF value from 1 year to 25 
years due to both hardware and software errors, as we note that (i) ASCI-Q has a per-node MTTF 
of 1 year [84]; (ii) IBM 380 X processor has an MTTF of 8 years [95]; (iii) IBM mainframes have 
 151 
an MTTF of 25 years; and (iv) IBM G5 processor is advertised with an MTTF of 45 years [96] 
(hardware failures only).  
A.3.5. Correlated Failure 
This appendix models two categories of correlated failures: (i) correlated failures due to error 
propagation only and (ii) generic correlated failures.  
For correlated failures due to error propagation, we assume that recovery fully restores the 
application/system state and propagated errors do not cross recovery boundaries. The error 
propagation is characterized by a short error burst, which typically impacts the recovery. The 
duration of the error burst is referred to as the correlated failure window. The system may need to 
recover several times before a successful recovery [97]. A typical value of the correlated failure 
rate is 600 times the normal failure rate [87] (see Section A.6).  
Correlated failures may also be caused by factors other than error propagation, e.g. common 
causes such as increases in the node temperature or some environmental phenomena. Usually a 
hyper-exponential distribution is assumed for modeling generic correlated failures, i.e. the system 
experiences an independent failure rate and a correlated failure rate alternatively. Unlike 
correlated failures due to propagation, the semantics of generic correlated failures is not 
necessarily limited to a short duration, but forms a global view of the system for the entire system 
life. 
A.4. Overall Composition of the Model 
The system is decomposed into several sub-systems. Each subsystem is modeled as a separate 
SAN submodel, and the overall model is obtained by integrating these submodels. All the 
 152 
compute nodes are modeled as a single unit and all the I/O nodes are modeled as another unit. 
This allows the model to scale to a large number of nodes without requiring a large simulation 
time. Table 19 lists the submodels of the entire system and Figure 32 illustrates how these 
submodels (each oval in Figure 32 represents a SAN submodel) are integrated into an overall 
model. The arrows in the figure illustrate the logical interactions between the submodels. These 
interactions are implemented by state sharing between the SAN submodels. The dots in the 
submodels in Figure 32 indicate the initial position of the tokens in the corresponding SAN. It 
should be emphasized that Figure 32 is not a state diagram, in that the ovals are not 
representations of the states of the system at any particular time. The submodels are organized 
into four modules: computing & checkpointing, failure & recovery, correlated failure, and useful 
work computation.  
Computing & checkpointing module. The compute_nodes submodel depicts the computation 
and checkpointing behavior of the compute nodes in the failure-free mode. While the compute 
nodes are in execution, the application may be performing either computation or I/O operations 
and this is represented in the app_workload submodel. The master submodel represents the 
master node in the coordinated checkpointing protocol. It triggers and coordinates the 
checkpointing, as modeled in the compute_nodes submodel. The coordination among the compute 
nodes is modeled in the coordination submodel. The io_nodes submodel captures the I/O 
operations conducted by I/O nodes. It receives data from the compute_nodes submodel, 
writes/reads checkpoints to/from the file-system, and writes data on behalf of the application in 
the app_workload submodel. These five submodels form the computing & checkpointing module 
of the system model and are further described in Section A.5. 
 153 
Table 19: Submodel List 
 
 
Figure 32: The Overall Composition of the Model 
Module Submodel Comments 
App_workload Application state: performing computation or 
I/O operations 
Compute_nodes Compute processor state in the checkpoint 
cycle: executing (including both application’s 
computation and I/O operations), quiescing, or 
checkpoint dumping 
Coordination Coordination procedure for checkpointing 
io_nodes I/O processor state: idling (including data 
transmission between compute nodes), writing 
application data, writing checkpoint, or reading 
checkpoint; if checkpoint is locally buffered 
Computing & 
Checkpointing 
Master System checkpointing state: if checkpointing is 
started or not 
comp_node_failure Failure behavior of compute nodes 
comp_node_recovery Recovery behavior of compute nodes 
io_node_failure Failure behavior of I/O nodes 
io_node_recovery Recovery behavior of I/O nodes 
Failure & 
Recovery 
system_reboot System reboot operation 
Correlated 
Failure 
correlated_failures Correlated failure behavior 
Useful Work useful_work Useful work computation 
app_workload
compute_nodes io_nodescoordination
comp_node_failure io_node_failurecomp_node
_recovery io_node_recovery
correlated_failures
system_reboot
useful_work
checkpointing control
detail
expansion I/O operation
checkpoint
dump/
read
failure failure
useful work 
computation
useful work 
computation useful work 
computation
recovery 
starts
recovery completes
I/O 
failure
recovery 
starts
recovery 
completes
severe failures
severe failures
reboot 
completes
reboot 
completes
failure rate 
control
failure rate 
control failure rate control
failure rate control
computing & checkpointing usefulwork
failure & 
recovery
correlated 
failure
master
detail
expansion
useful work 
computation
comp_node_failure
useful work 
computation
 
 154 
Failure & recovery module. A compute node or I/O node may fail in any of its states. The 
occurrence of failures in compute nodes is modeled in the comp_node_failure submodel. The 
recovery is initiated following the detection of the failure, and is modeled in the 
comp_node_recovery submodel. As failures may also occur during recovery, compute nodes may 
experience multiple failures and subsequent recoveries in the comp_node_recovery submodel 
before the final successful recovery, after which the system resumes the normal execution and 
checkpointing cycle. Failures of compute nodes do not affect the I/O nodes if error propagation is 
not considered. The behavior of I/O nodes is similar, except that when an I/O node fails while 
writing application data to the file system, the application results are lost and the system rolls back 
to the last checkpoint. This is represented in Figure 32 by an arrow from the io_node_failure to 
the comp_node_failure submodels.  
The recovery process occurs in two stages. First, the I/O nodes read the checkpoint from the 
filesystem and buffer it in their local memories. Then the compute nodes read the checkpoint from 
the I/O nodes and complete the recovery. The compute nodes then go back to the execution state, 
the master process gets reset and the system exits the correlated failure window if there was one. 
If the checkpoint is already locally buffered in the I/O nodes when a compute node fails, the first 
stage is skipped. If an I/O node fails while writing out a checkpoint, the checkpoint is aborted and 
the I/O nodes get restarted, but the compute nodes are not affected.  
If the number of unsuccessful recoveries in the comp_node_recovery and/or io_node_recovery 
submodel(s) exceeds a predefined threshold, the whole system, including the compute nodes and 
I/O nodes, is rebooted in system_reboot (“severe failures” transitions from comp_node_recovery 
and io_node_recovery to system_reboot in Figure 32). When the reboot completes, I/O processors 
are ready for execution, but compute nodes still need to read the last checkpoint and recover. So 
 155 
the arrows of “reboot completes” from the system_reboot submodel point to the io_nodes and 
comp_node_failure submodels, instead of the compute_nodes submodel in Figure 32.  
Correlated failure module. The correlated_failures submodel models the semantics of 
correlated failures separately from the compute and I/O nodes’ failure and recovery submodels. It 
controls the rates of all failures in the system. When a correlated failure occurs, the system enters 
a correlated failure window, in which it experiences failures with a higher rate than the 
independent failure rate. Note that independent failures can continue to occur when the system is 
within a correlated failure window. 
Useful work module. The useful_work submodel calculates the useful work completed by the 
system. A positive reward is accumulated when the compute nodes perform job computation or 
I/O operations, and a negative reward equal to the amount of the lost work is applied when a 
compute node fails. 
A.5. Modeling Computing and Coordinated Checkpointing 
In this section, we describe the details of modeling the computing and coordinated checkpointing 
module using SANs. Due to space limitation, detailed SAN models of the other three modules are 
not described in this appendix. The reader may refer to the report [98] for this. 
Figure 33 shows the SAN submodels for the computing and coordinated checkpointing module. 
States are shared among the submodels with the same names. Selected shared states are numbered 
in Figure 33 to help identify them.  
 156 
 
(a) compute_nodes
(b) io_nodes
(c) app_workload
(d) master (e) coordination
1
1
2
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
9
 
Figure 33: Submodels for Computing and Checkpointing 
 157 
When the application is started in the system, the compute nodes start out in the execution state 
and the master is in the master_sleep state. We assume the application starts doing computation 
and the app_workload is in the compute state. The I/O nodes are in the ionode_idle state. Initially, 
each of these states has a token, indicated by block arrows in Figure 33. In our model the 
non-random events are modeled as deterministic activities, and exponential distribution is 
assumed for random events. To simplify the model while still preserving its validity, message 
transmissions are not explicitly modeled in SAN, but the parameters of the corresponding events 
are appropriately set to include the message transmission latency. The following steps detail the 
behavior of the model. 
• First, assume that the checkpoint interval expires and the checkpoint activity is enabled. The 
master moves from the master_sleep state to the master_checkpointing state and starts a timer as 
shown by the start_timer gate (Figure 33d). 
• The compute nodes are initially in the state execution. When the master moves to 
master_checkpointing, the compute nodes move to the quiescing state after a latency of 
recv_quiesce_bcast _time (broadcast overhead) (Figure 33a). 
• Henceforth, the behavior depends on whether the application workload is performing 
computation or I/O. If the app_workload is in the compute state, the coordination for 
checkpointing is started, as shown in the to_coordination activity. If the app_workload is in the IO 
state, the compute nodes wait till the I/O completes before starting the coordination activity 
(Figure 33c). 
 158 
• After the coordination activity (coord) completes, a token is placed in the 
complete_coordination state, enabling the activity coordinate in compute_nodes and the compute 
nodes move from quiescing to checkpointing (Figure 33e, Figure 33a). 
• If the timer expires before the coordination is complete, it places a token in the timedout state. 
This activates the skip_chkpt2 activity in compute_nodes, causing the compute nodes to abort the 
checkpointing and move to the back_to_execution state (Figure 33d, Figure 33a and Figure 33e).  
• When the compute node is in the state checkpointing and the I/O node is in the state 
ionode_idle, the dump_chkpt activity is enabled. The checkpoint dump time depends on the 
checkpoint size and the bandwidth between the compute nodes and the I/O nodes (Figure 33a). 
• After storing the checkpoint, the compute nodes go back to the execution state. The 
completion of this activity also places tokens in the enable_chkpt state (Figure 33a). 
• When the I/O node is in ionode_idle, it sees the token in the enable_chkpt state and goes to the 
writing_chkpt state. This enables the write_chkpt activity, which models the writing of the 
checkpoint to the file system. The latency of the write depends on the checkpoint size and the 
bandwidth between the I/O node and the file system (Figure 33b). 
• If the I/O node is not in ionode_idle, the compute node has to wait for the I/O node to come to 
the ionode_idle state before sending the checkpoint to it. This prerequisite is enforced by the 
ionode_is_idle input gate (Figure 33a). 
• When the checkpointing is completed or aborted, tokens are placed in the two states 
chkpt_completed_or_aborted and to_reset_processor_state. The tokens cause the master to move 
back to the master_sleep state and the app_workload to reset at the compute state (Figure 33c). 
 159 
Since the model considers all the compute nodes as a single unit, it does not reflect the 
discrepancy in the quiesce times among the compute nodes and does not show how the variation 
in the quiesce time among the nodes can cause the master to timeout. This behavior is modeled 
separately in the coordination submodel (Figure 33e). It is assumed that each node has an 
identical, exponentially-distributed quiesce time with the mean of MTTQ. We use a random 
variable Y, representing the maximum of all the quiesce times, to model the coordination time as 
follows. 
Let n and Xi denote the number of compute nodes and the ith node’s quiesce time, respectively, 
and Y = max{Xi} (1 ≤ i ≤ n). Then, the CDF of Y is FY(y)=(FX(y))n=(1-e-λy)n where λ is the quiesce 
rate of a single compute node. Y can be generated from a uniform random variable U between 0 
and 1 by Y = -1/λ·log(1-U1/n). The value of Y is used as the latency in the coord activity in the 
coordination submodel to represent the coordination process. 
A.6. Modeling Correlated Failures  
Two categories of correlated failures are modeled in the appendix: (i) correlated failures due to 
error propagation, and (ii) generic correlated failures. Both are modeled by appropriately 
increasing the node/processor failure rates. This section describes how these increased rates are 
derived. 
Correlated failures due to error propagation. When an independent failure occurs in the 
system, with some probability pe there is a conditional probability of a second failure due to the 
first. This results in an increased failure rate. We compute this failure rate increase for all nodes by 
multiplying the independent failure rate with a constant parameter called frate_correlated_factor.  
 160 
F0 F1 F2
λi λc λc
µ
µ
µ
 
Figure 34: Birth-death Markov process of correlated failures 
Figure 34 shows the birth-death Markov process of correlated failures due to error propagation. λi 
and λc denote the rates of the system-wide independent failures and successive correlated failures, 
respectively. λ is the independent failure rate of a single node. µ denotes the recovery rate of the 
system. Fi is the system state in which i failures have occurred before a successful recovery. As we 
assume that any successful recovery wipes off all latent errors, all the Fi states transit directly to 
F0 with the recovery rate. It is also assumed that the failure rates at all the Fi states (i > 0) are the 
same. So, the conditional probability of another failure occurrence, provided that a failure occurs, 
is p= λc/( λc +µ), and consequently, λc=pµ/(1-p). Let n denote the number of nodes, and r denote 
the multiple frate_correlated_factor. Then according to the model, λc=λi+rnλ=nλ(1+r), and hence, 
r=pµ/((1-p)nλ)-1. For a given set of n, λ and µ, r, i.e. frate_correlated_factor, actually represents 
the conditional probability p. As long as λc>λi, r can be chosen independently to study a range of 
correlated failure effects. For example, when n=1024, p=0.3, MTTR=10 min, and MTTF=25 yrs, 
r is about 600. 
Generic correlated failures. The system may suffer from generic correlated failures at any 
instant of the system life. A correlated failure coefficient ρ is assumed to model generic correlated 
failures, which is the unconditional probability of a correlated failure occurring at any time. Table 
20 lists the parameters used for modeling generic correlated failures. Then, the failure rate of 
generic correlated failures is given by λs = λsi + ρλsc = nλ + ρrnλ = nλ(1 + ρr). Note that λsi, λsc, and 
ρ are not the same as the λi, λc, and ρe in the discussion of correlated failures due to error 
 161 
propagation, because they model different probabilities. The symbols n, λ and r have the same 
meanings in both models. 
Table 20: Parameters for Modeling Generic Correlated Failures 
λs Failure rate of the entire system 
λsi rate of independent failures in the system 
λsc rate of correlated failures in the system 
λ independent failure per node 
r Increased failure rate due to correlated failures 
ρ correlated failure coefficient 
n number of nodes 
A.7. Experiment Setup 
We use the Mobius modeling environment [99] for creating and simulating the SANs. 
Steady-state simulation is used with an initial transient period of 1000 hours to allow the system to 
enter the steady state. The confidence level is 95%. Unless otherwise specified, the parameter 
values used in the simulation are given in Table 21. The table also explains the rationale for 
choosing these parameter values based on field data or projections of future systems. 
 162 
Table 21: Experiment Parameters 
Parameter Value /Range Comments 
Checkpoint Interval 15 mins to 4 
hrs 
Derived from other studies [78], and private 
communication with vendors 
MTTF (Mean Time To 
Failure per node) 
1 – 25 years Including software and hardware failures 
recovered from checkpoint. 1 year for ASCI Q 
and 25 years for IBM mainframes. 
MTTR (system-wide 
Mean Time To Recovery 
of compute nodes) 
10 minutes Average time for all compute nodes to read 
checkpoint and reinitialize themselves. 
MTTR of IO Nodes 1 minute Time to restart the I/O nodes 
Number of Compute 
Processors 
8K to 256K Projection of current and future supercomputers 
MTTQ (per-node Mean 
Time to Quiesce) 
0.5-10 s Time to close I/O and network file handles, clean 
up states, and perform computation until reaching 
a safe point 
Broadcast Overhead 1 ms E.g. data for hardware broadcast trees in 
Blue-Gene/L [78] 
Software overhead for 
transmission 
1 ms Measurement of message latency in TCP/IP and 
UDP 
Period of I/O – compute 
cycle in application 
3 minutes Experimental data on I/O characteristics of 
parallel applications [100] 
Fraction of Computation 0.88 – 1.0 Experimental data on I/O characteristics of 
parallel applications [78] 
Timeout Value 20 secs to 2 
min 
The period for the master to timeout and cancel 
the checkpointing. 
Probability of  correlated 
failure 
0 to 0.2 Experimental data on correlated failures, e.g. [87] 
Correlated failure Rate 1/MTTF* 
(100~1600)  
Projections on error propagation within a 
locally-federated cluster of nodes in the 
supercomputer 
Correlated Failure 
Window 
3 mins Experimental data for persistence of correlated 
failures in the system due to error propagation 
System Reboot time 1 hr Anecdotal evidence for startup time of a large 
cluster 
Aggregate bandwidth 
between compute nodes 
and one I/O node 
350 MBps 
Number of compute nodes 
per I/O node 
64 
file system bandwidth per 
I/O node 
1 Gbps 
Checkpoint-size per node 256 MB 
E.g. Blue-Gene/L field data [78] 
Average size of I/O data 
per node 
10 MB Experimental data on typical characteristics of 
parallel applications [100] 
 163 
A.8. Experiment Results 
Simulations are conducted to study the behavior of the model. As the modeled system is 
complicated and there are multiple mechanisms/parameters present, we study the system by 
analyzing the effect of one feature at a time. Hence, the base model without coordination or 
correlated failures (but with failures during checkpointing and recovery) is first studied to 
understand the basic system behavior. Then we study the effects of coordination and correlated 
failures. The following two metrics are used to evaluate system performance.  
• Useful work fraction: fraction of time the system makes forward progress towards the 
completion of the job. It does not include work that is repeated due to failures. 
• Total useful work: the product of useful work fraction and the number of compute processors. 
It indicates how many processors of the same kind are required to achieve the same performance 
assuming failure-free computation. 
A.8.1. Study of Base Model 
For the base model we assume independent failures and consider the coordination time to be a 
fixed ‘quiesce’ time. The system performance is analyzed for a range of parameters, including the 
number of processors, checkpoint interval, MTTF per node and MTTR of the system. The values 
assumed for these parameters are: 
• Number of processors per node: 8 
• MTTF per node: 1 year 
• MTTR of the system: 10 minutes 
 164 
• Number of processors: 64K 
• Checkpoint Interval: varied from 15 minutes to 4 hours 
We report results for the number of processors instead of the number of nodes in this study so that 
they can be easily scaled to different number of processors per node. The major results are as 
follows: 
 For a given checkpoint interval (30 mins), MTTR (10 mins) and MTTF (1 yr per node), 
there is an optimum number of processors (128 K) for which total useful work done by the system 
is maximized. Adding more processors than this optimum value will hurt the system performance 
due to failure effects.25 The range of processors considered in this analysis is from 8K to 256K, 
and the optimum value of the number of processors varies from 128K to 32K as the MTTR varies 
from 10 minutes to 80 minutes. 
 For the system to be scalable, checkpoints should be taken on the granularity of minutes 
(15-30 min), rather than hours as is the current practice. While in theory there is an optimal 
checkpoint interval, for any practical range there is no ‘optimal’ checkpoint interval for which the 
useful work is maximized, contrary to what several other studies have shown [80], [81]. This is 
because the overhead of checkpointing is relatively low in our system as the checkpoint writing is 
done in the background, and the effect of failures dominates the effect of taking checkpoints 
frequently.  
                                                        
25
 Elnozahy et al. [84] also make a conjecture that increasing the number of nodes beyond a certain extent hurts 
performance, but do not quantify the extent of the performance loss. 
 165 
 Even when the useful work is maximized, the overall useful work fraction is no more than 
50% for an MTTF per node of 1 year. Hence, more than 50% of the system’s resources are spent 
in checkpointing and recovering from failures. 
 If the number of processors per node is increased to 32 instead of 8, and the per-node MTTF 
is maintained the same as 1 year, it is possible to increase the total useful work for the same 
number of nodes. This is because more compute power is provided per node, while maintaining 
the same failure rate. The optimum number of processors is in the range of 500K. However, the 
useful work fraction is unaltered, as the system failure rate, which depends only on the number of 
nodes and the per-node failure rate, is the same. 
Variation of total useful work with number of processors. Figure 35a, c, and e show the variation 
of total useful work with different number of processors. In all three figures, there is an optimum 
value of the number of processors for which total useful work is maximized. The rationale behind 
this is as follows: on one hand, more processors provide higher computing power for the job, 
while on the other hand more processors incur more frequent failures and hence more 
computation is wasted due to failures. For small numbers of processors the former factor 
dominates, while for sufficiently large numbers of processors the latter outweighs the former. 
Consider how the optimum number of processors varies as the MTTF, MTTR and checkpoint 
interval.  
• The optimum value decreases with smaller MTTFs as shown in Figure 35a (from 128K 
processors for an MTTF of 1 year per node to 64K processors for an MTTF of 0.5 years per node).  
 
 166 
 
Useful Work Vs Number of Processors for 
different MTTFs   (MTTR = 10 mins hrs, 
checkpoint interval = 30 mins)
0
20000
40000
60000
80000
100000
120000
81
92
16
38
4
32
76
8
65
53
6
13
10
72
26
21
44
Number of Processors
To
ta
l U
se
fu
l W
o
rk
 MTTF (yrs) =  0.125  MTTF (yrs) =  0.25
 MTTF (yrs) =  0.5  MTTF (yrs) =  1
 MTTF (yrs) =  2
 
(a) 
Useful Work Vs Checkpoint Interval for different 
numbers of processors  (MTTF per node=1 yrs, 
MTTR = 10 mins)
0
20000
40000
60000
80000
15 30 60 120 240
Checkpoint Interval (mins)
To
ta
l U
se
fu
l 
W
o
rk
  processors = 8192   processors = 16384
  processors = 32768   processors = 65536
  processors = 131072   processors = 262144
 
(b) 
 Useful Work Vs Number of Processors for 
different MTTRs  (MTTF per node =1 yr , 
chkpt_interval=30 mins)
0
10000
20000
30000
40000
50000
60000
8192 16384 32768 65536 131072 262144
Number of Processors
To
ta
l U
s
e
fu
l W
o
rk
MTTR (mins) =  10 MTTR (mins) =  20
MTTR (mins) =  40 MTTR (mins) =  80
 
(c) 
Useful Work Vs Checkpoint Interval for different 
MTTRs  (MTTF per node =1 yr, number of 
processors = 65536)
0
10000
20000
30000
40000
50000
15 30 60 120 240
Checkpoint Interval (mins)
To
ta
l U
se
fu
l W
o
rk
 MTTR(mins) = 10  MTTR(mins) = 20
 MTTR(mins) = 40  MTTR(mins) = 80
 
(d) 
Useful Work Vs Number of Processors for 
different checkpoint intervals (MTTF per 
node=1 yr, MTTR=10 mins)
0
20000
40000
60000
80000
8192 16384 32768 65536 131072 262144
Number of Processors
To
ta
l U
s
e
fu
l W
o
rk
chkpt_interval (mins) =  15
chkpt_interval (mins) =  30
chkpt_interval (mins) =  60
chkpt_interval (mins) =  120
chkpt_interval (mins) =  240
 
(e) 
Useful Work Vs Checkpoint Interval for different 
MTTFs (MTTR=10 mins, number of processors = 
65536)
0
10000
20000
30000
40000
50000
60000
15 30 60 120 240
Checkpoint Interval (mins)
To
ta
l U
se
fu
l 
W
o
rk
 MTTFper node (yrs) =  1  MTTFper node (yrs) =  2
 MTTFper node (yrs) =  4  MTTFper node (yrs) =  8
 MTTFper node (yrs) =  16
 
(f) 
Variation of Total Useful Work with Number 
of Nodes, Number of Processors/Node = 32
0
100000
200000
300000
400000
500000
8192 16384 32768
Number of Nodes
To
ta
l U
se
fu
l W
o
rk
 MTTFper node(yrs) =  1  MTTFper node(yrs) =  2
 
(g) 
Variation of Total Useful Work with Number 
of Nodes, Number of Processors / Node = 16
0
50000
100000
150000
200000
250000
8192 16384 32768 65536
Num ber of Nodes
To
ta
l U
se
fu
l W
o
rk
MTTF per node  (yrs) =  1 MTTF per node  (yrs) =  2
 
(h) 
Figure 35: Study of Base Model 
 167 
• The optimum value decreases with larger MTTRs (from 128K processors for an MTTR of 
20 minutes to 64K processors for an MTTR of 40 minutes) as shown in Figure 35c.  
• The optimum value decreases with larger checkpoint intervals as shown in Figure 35e (from 
128K processors for a checkpoint interval of 30 minutes to 64K processors for a checkpoint 
interval of 60 minutes).  
This is because smaller MTTFs increase failure rate, larger MTTRs increase the penalty of a 
failure and the larger checkpoint intervals cause more work to be lost upon a failure. All the three 
aggravate the effects of failures, thus lowering the equilibrium point between the computing 
power and the failure effect.  
Variation of total useful work with checkpoint intervals. Figure 35b, d and f show the variation 
of total useful work for different checkpoint intervals. The results indicate that for a large-scale 
supercomputing system there is no optimum value of the checkpoint interval within the range of 
values considered (15 minutes to 4 hours). This contradicts previous studies [80], [81] which have 
shown the existence of an optimum value of the checkpoint interval. This is because the loss of 
job computation due to failures in large-scale systems outweighs the overhead of frequent 
checkpointing, as our checkpoint overhead is low. The theoretical optimum value of the 
checkpointing interval is less than 15 minutes. However, checkpoint intervals less than 15 minutes 
are not considered because checkpoints as frequent as these may overwhelm the I/O subsystem 
and the network and hence, are not practical. 
Further, the total useful work is approximately constant for checkpoint intervals between 15 to 30 
minutes, but decreases sharply as the checkpoint interval is increased beyond 30 mins (for an 
 168 
MTTF of 8 years, the total useful work only decreases from 43,000 job units26 to 40,000 job units 
when the checkpoint interval is increased from 15 minutes to 30 minutes, but drops to 30,000 job 
units when the checkpoint interval is increased to 60 minutes). This suggests that current 
checkpoint intervals in the granularity of hours are not appropriate for large-scale systems because 
of the high system failure rate. The checkpoint intervals should be between 15 and 30 minutes.  
Useful Work Fraction. The discussion above only uses total useful work as the performance 
metric. The useful work fraction steadily decreases as the number of processors increases. This is 
because the greater number of processors does not contribute to the useful work fraction, and the 
failure effect degrades the useful work fraction. So, even when the maximum total useful work is 
achieved at the optimum number of processors, the useful work fraction is still small. For example, 
for an MTTF of 1 year per node in Figure 35a, the peak of total useful work is obtained with 128K 
processors, for which the useful work fraction is only about 56000/131072=42.7%, i.e. over 50% 
of system time is spent in handing failures. Thus, the overall failure rate of the system must 
substantially decrease for the useful work fraction to improve significantly. 
Effect of Increasing Number of Processors per Node. So far, we have assumed that each node 
has 8 processors, and that the MTTF of a node is 1 year. In the future, advances in semiconductor 
and processor technology may allow 16 or 32 processor cores to be integrated on a single node, 
while maintaining the same MTTF per node of 1 year. We studied the variation of total useful 
work with the number of nodes when each node has 32 and 16 processors, respectively, for a 
per-node MTTF of 1 and 2 years. For fair comparison, the number of processors is fixed at 1000K. 
The results are shown in Figure 35g and Figure 35h and are summarized as follows: 
                                                        
26
 One job unit is the amount of work done by a failure-free processor without checkpointing in unit time. 
 169 
• The optimum number of processors is obtained by multiplying the number of nodes by the 
number of processors per node. The optimum number of processors is now in the range of 500K 
to 1000K processors. 
• For a given MTTF, the optimum number of nodes increases with the number of processors 
per node as more compute power is provided at the same failure rate 
• For a given number of processors per node, the optimum number of nodes increases as the 
MTTF increases because the failure effect is less dominant 
This reinforces the earlier observation that integrating more processors per node and maintaining 
the same node failure rate increases the total useful work. However, the useful work fraction 
remains the same (still less than 50%) as it depends only on system failure rate, which in turn 
depends only on number of nodes and the MTTF per node. 
Effect of failures during checkpointing/recovery. We also studied the effects of failures during 
checkpointing/recovery on system performance. We observed that they do not exert as significant 
an effect as failures during computation on useful work fraction, because the duration of 
checkpointing/recovery is much smaller than that of computation, and hence, incurs less loss of 
useful work. Detailed analysis of failures during checkpointing/recovery is not presented. 
For the remainder of Section A.8, we assume an increased per-node MTTF of 3 years as otherwise 
the failure effects dominate the system performance for large numbers of nodes. An MTTF of 3 
years corresponds to a per-processor MTTF of 24 years for our system consisting of 8 processors 
per node, which is close to the 25-year MTTF of IBM mainframes reported in the literature [96]. 
 
 170 
A.8.2. Effect of Coordination 
The coordinated checkpointing protocol requires all the compute processors to arrive at a safe 
point to take the checkpoint, and a timeout is used to avoid waiting indefinitely. This is not 
considered in the base model. This section first investigates the pure coordination effect without 
the timeout mechanism or failures, and then combines them into the study. Three main points are 
observed from the results: 
 Coordination does not affect the system performance significantly, as the coordination effect 
is logarithmic in the number of compute processors (Figure 36), because we assume the 
processors have identical exponentially-distributed quiesce times. So coordination scales well for 
practical systems. 
 Combination of timeout and coordination behaves like a probabilistic checkpoint-abort. 
Small timeouts (80 s or less in Figure 37) hurt useful work fraction, whereas large timeouts (100s 
or larger) do not significantly degrade useful work fraction.  
 As long as the coordination timeout is equal to or larger than a threshold value, the system 
performance is insensitive to the timeout value. The threshold value is fairly small for practical 
systems (100 s in our experiment).  
Coordination only. We assume that all the processors have identical, exponentially-distributed 
quiesce times with a mean of MTTQ (mean time to quiesce per processor). Figure 36 illustrates 
the pure coordination effect on useful work fraction for different MTTQs. Failures and timeouts 
are not considered. According to the figure, the coordination effects are logarithmic in the number 
of compute processors. This is because an identical exponential distribution is assumed for each 
 171 
processor. Moreover, the rate of increase of coordination time (or overall quiesce time) is 
proportional to the MTTQ, and the coordination effect is also proportional to the checkpoint 
frequency (figures not shown here). 
 
Effects of failures and timeouts. Figure 37 shows the system performance in the presence of 
failures with an MTTF of 3 years per node, checkpoint interval of 30 minutes, and MTTQ of 10 
seconds. We use “no coordination” to indicate the case when no variation in the quiesce times 
among the compute processors is assumed and the quiesce time of the system as a whole is 
exponentially distributed with a mean of 10 seconds. 
Figure 37 shows that the coordination without a timeout mechanism does not significantly 
degrade system performance, because the only additional overhead is the small coordination time. 
If a timeout is applied, the master may time out before the coordination is completed and abort the 
checkpointing. Then if a failure occurs in the next computation interval, it causes the computation 
completed in the last interval to be lost. So the combination of coordination and timeout actually 
Useful work fraction with coordination 
(checkpoint interval=30min)
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1 4 16 64 25
6
10
24
40
96
16
38
4
65
53
6
26
21
44
10
48
57
6
41
94
30
4
16
77
72
16
67
10
88
64
26
84
35
45
6
1.0
74
E+
09
number of processors
Us
e
fu
l w
o
rk
 
fra
c
tio
n
MTTQ=10s
MTTQ=2s
MTTQ=0.5s
 
Figure 36: Effects of coordination on 
system performance and scalability (no 
timeouts or failures) 
Useful work fraction with coordination and timeout 
(MTTF per node=3yrs, checkpoint interval=30min)
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0.900
1.000
8192 16384 32768 65536 131072 262144
number of processors
U
s
e
fu
l w
o
rk
 
fra
c
tio
n
no coordination
no timeout
timeout=
120s
timeout=
100s
timeout=
80s
timeout=60s
40s
20s
 
Figure 37: Effects of coordination 
timeout on system performance and 
scalability (with failures) 
 
 172 
behaves like a probabilistic checkpoint-abort mechanism. The probability depends on the 
coordination time (MTTQ and number of compute processors) and the timeout. Small timeouts 
incur large probabilities of checkpoint abortion, and the benefit of limiting the processors’ waiting 
time is offset by the loss of work due to frequent checkpoint abortions. The drastic curve drops for 
timeouts of 20-100 seconds in Figure 37 clearly show the performance degradation.  
Figure 37 also shows that the system is insensitive to timeouts provided they are large enough 
because the overall coordination time increases slowly with the number of processors. For 
example, the 8192-processor system’s performance with a timeout of 100s is only slightly better 
than a timeout of 120 s and no timeout. 
A.8.3. Effect of Correlated Failures 
We recall that there are two categories of correlated failures considered in the appendix.  
Correlated failures due to error propagation only. Correlated failures due to error propagation 
are modeled with three parameters: probability of correlated failure (pe), frate_correlated_factor 
(r) and correlated failure window. As shown in Section A.6, a typical value of r in real systems is 
of the order of a few hundred. In our experiments, r values of 400, 800 and 1600 are used for 
various pe values, with a correlated failure window of 3 minutes.  
The results of correlated failures in Figure 38 show that useful work fraction is not susceptible to 
correlated failures due to error propagation (ranging between 0.51 and 0.56 in the figure), because 
we assume these failures only occur during recovery, and we observed that failures during 
recovery do not exert a significant effect on useful work fraction.  
Generic correlated failures. Generic correlated failures are modeled with two parameters: 
 173 
correlated failure factor (r) and correlated failure coefficient (ρ). An r value of 400 and ρ value of 
0.0025 is used in our experiment. So the entire system failure rate gets doubled because of generic 
correlated failures. The results illustrated in Figure 39 show that, unlike correlated failures due to 
error propagation, there is a large performance degradation when generic correlated failures are 
present, and the performance degradation prevents the system from scaling well. For a system 
comprising 256K processors with a MTTF of 3 years per node, the useful work fraction is reduced 
by 0.24 (51%). 
Useful work fraction (MTTF per node=3yrs, number of 
processors=256K, correlated failure window=3min)
0.490
0.500
0.510
0.520
0.530
0.540
0.550
0.560
0.000 0.050 0.100 0.150 0.200
Prob. of correlated failure
U
s
ef
u
l w
o
rk
 
fra
c
tio
n
frate_correlated_times=400
frate_correlated_times=800
frate_correlated_times=1600
 
Figure 38: Impact of correlated failures 
due to error propagation 
Useful work fraction (MTTF per node=3yrs, 
correlated failure 
coefficient=0.0025,correlated failure 
factor=400, checkpoint interval=30min)
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
8192 16384 32768 65536 131072 262144
number of processors
U
se
fu
l w
o
rk
 
fra
c
tio
n
without correlated 
failure
with correlated 
failure
 
Figure 39: Impact of generic correlated 
failures 
A.9. Conclusions  
This appendix models a large-scale supercomputing system with coordinated checkpointing and 
rollback recovery. Unlike existing models in the literature, failures during checkpointing/recovery, 
coordination for checkpointing and correlated failures are included in the model. The impact of 
these factors on system performance (measured as useful work fraction and total useful work) as 
well as the scalability of systems with several hundred thousand processors is studied by 
simulating the model. The major conclusions from this study include: 
 174 
i. For a given checkpoint interval, MTTR and MTTF, there is an optimum number of 
processors for which total useful work done by the system is maximized, e.g., for an MTTF 
per node of 1 year and an MTTR of 10 minutes, it is around 128K. 
ii. The overall useful work fraction is relatively low because of the effect of failures in 
large-scale systems. 
iii. Correlated failures must be taken into account as they degrade the performance and limit the 
system scalability. 
 
 175 
APPENDIX B 
CHECKPOINTING OF CONTROL STRUCTURES IN MAIN 
MEMORY DATABASE SYSTEMS  
 
B.1. Introduction 
Main memory database (MMDB) systems store data permanently in main memory, and 
applications can access the data directly [101]. This offers high-speed access to shared data for 
applications such as real-time billing, high-performance web servers, etc. However, it also makes 
database systems highly vulnerable to application errors/failures, as the database is directly 
mapped into the application’s address space. 
In addition to user data, the database maintains the control structures (e.g., lock/mutex tables and 
file tables) necessary for data operation. A database management system (DBMS) maintains data 
integrity, including recovery in the case of an error/failure of either applications or database 
services. However, the integrity of control structures is often not well maintained due to the less 
uniform interfaces to control structures (compared with those to user data). As a result, errors in 
control structures can become a major cause of system downtime and, hence, an availability 
bottleneck. 
In this appendix, we propose and evaluate an application transparent, low-overhead checkpointing 
strategy for maintaining the consistency of control structures in a commercial MMDB. The 
proposed solution is based on the ARMOR architecture and an ARMOR runtime infrastructure 
 176 
[102], [103]. It eliminates (or significantly reduces) cases requiring database major recovery, a 
lengthy process that can take tens of seconds and adversely impact availability. Importantly, the 
approach can be adapted relatively easily to other applications, and the ARMOR runtime support 
creates a foundation for providing system-wide error detection and recovery. This work makes the 
following contributions: 
  Introduction of a framework to provide support for checkpointing of MMDB control 
structures. 
 Design and implementation of two checkpointing algorithms. (i) Incremental checkpointing: 
(a) At runtime, a post-transaction (upon transaction completion) state of the control structure(s) 
accessed by each write transaction (an update of the control structures) is collected and merged 
with the current checkpoint. (b) At recovery time, the checkpoint is used directly to restore the 
correct state. (ii) Delta checkpointing: (a) At runtime, a pretransaction (before any updates occur) 
state of the control structure(s) accessed by a given transaction (both write or read-only) is 
preserved as a current checkpoint (delta). (b) At recovery time, the current state of control 
structures in the shared memory is merged with the delta checkpoint to restore the state. 
 Performance evaluation of the proposed checkpointing algorithms. The data show that for a 
rather harsh workload of 60% write transactions, the performance overhead varies in the range of 
1% to 10%, depending on the frequency of transactions. 
 Database availability estimation under different frequencies of crashes that require major 
recovery. The data show that under the error rate of one crash per week, a checkpointing-based 
solution provides about five nines of availability, one nine more than the baseline system. 
 177 
B.2. Target System Overview 
Target system. The target system in this study is a commercial relational MMDB intended to 
support development of high-performance, fault-resilient applications requiring concurrent access 
to shared data [104], [105]. The process accessing the shared data can be either a client or a 
database service. A service is a process that performs functions to assist the proper processing of 
transactions. For example, the cleanup service detects failures of connected clients/services and 
performs recovery (including launching major recoveries). 
In addition to supporting user data, the database supports control structures (SysDB) necessary for 
correct operation. Figure 40 depicts the example architecture of SysDB containing three tables: (i) 
the process table, which maintains process ids and mutex lists for each process as well as 
information on database mapping into the process address space, (ii) the transaction table, which 
maintains logs and locks for active transactions, and (iii) the file table, which keeps user database 
files. Each client/service process maps SysDB into its own address space before accessing the 
database. 
Reliability problem. The error model we address in this appendix is the inconsistency of control 
structures due to the abnormal termination (crash) of one of the clients or services1.27 Upon such 
a crash, the target system denies services to all other user processes and restarts the entire database 
system; this is a major recovery. It may take tens of seconds, depending on the size of the data 
files, and can significantly degrade system availability (not acceptable for services provided to 
critical applications). A major reason for these problems is the way the database handles access to 
control structures in SysDB. The system employs multiple mutexes to guarantee the mutual 
                                                        
27
 In the current implementation, we do not detect silent corruption of data, i.e., incorrect data being written to the 
database. 
 178 
exclusion semantic in accessing control structures by user processes. When a client or service 
crashes while still holding a mutex, the database may remain in an inconsistent state. Since there 
is no way for the system to identify which updates the crashed client has made, the cleanup 
process restarts the database to bring the system back into a consistent state. Because major 
recovery imposes significant system downtime, an approach is needed to eliminate or reduce 
cases in which it is needed.28 
 
Figure 40: Example Control Structures (SysDB) 
B.3. ARMOR High-Availability Infrastructure 
The ARMOR infrastructure is designed to manage redundant resources across interconnected 
nodes, detect errors in both the user applications and the infrastructure components, and recover 
quickly from failures when they occur. ARMORs (Adaptive Reconfigurable Mobile Objects of 
Reliability) are multithreaded processes internally structured around objects called elements that 
contain their own private data and provide elementary functions or services. All ARMORs 
                                                        
28
 In this discussion, we consider only preserving the consistent state of control structures. Any inconsistency brought 
to the user data is handled/recovered by the default database services, such as two-phase commit, checkpointing, and 
logging. As long as SysDB consistency is preserved, the system can operate correctly and recover user data. 
 179 
contain a basic set of elements that provide a core functionality, including the abilities to (i) 
implement reliable point-to-point message communication between ARMORs, (ii) respond to 
“Are-you-alive?” messages from the local daemon, and (iii) capture ARMOR state. 
ARMORs communicate solely through message-passing. The ARMOR microkernel (present in 
each ARMOR process) is in charge of distributing messages between elements within the 
ARMOR and between the ARMORs present in the system. A message consists of sequential 
operations that trigger element actions. This modular, event-driven architecture permits the 
ARMOR’s functionality and fault tolerance services to be customized by choosing the particular 
set of elements that make up the ARMOR. Several ARMOR processes constitute the runtime 
environment, and each ARMOR plays a specific role in the detection and recovery hierarchy 
offered to the system and the application. The Fault Tolerance Manager (FTM), Heart Beat 
ARMOR (HB), and Daemons are fundamental components of an ARMOR-based infrastructure. 
For more details on ARMOR architecture the reader is referred to [102], [106], [103]. 
B.4. ARMOR-Based Checkpointing 
Embedded ARMORs. In most of cases, an ARMOR launches the application and monitors its 
behavior. The application is treated as a black box, and only limited services can be provided by 
the ARMOR infrastructure, e.g., restart of the application process. In the embedded ARMOR 
solution, an application links the core structure of the ARMOR architecture (the ARMOR 
microkernel) and uses the ARMOR API to invoke/interface with the underlying element structure 
of an embedded ARMOR. The embedded ARMOR process appears (i) as a full-fledged ARMOR 
to other ARMORs and (ii) as a native application process to non-ARMOR processes (e.g., 
database clients). As a result, the application can take advantage of all services provided by 
 180 
ARMORs (e.g., adding or removing elements to customize ARMOR functionality) without 
having to change the original application’s organization. In this way, the application does not 
need to be rewritten, and only lightweight instrumentation with a few ARMOR APIs is needed to 
embed the ARMOR-stub into the application. 
ARMOR-based checkpointing. In order to expose ARMOR services, the database is 
instrumented in two ways: (i) ARMOR stubs are embedded in the database processes, facilitating 
communication channel(s) between the database server and the ARMOR infrastructure. (ii) 
Functionalities are embedded for checkpointing SysDB data structures; this modifies selected 
library functions of the database but preserves function interfaces and, hence, is transparent to 
clients.  
Figure 41 illustrates the basic architecture of the ARMOR infrastructure integrated with the target 
database. The FTM, FTM daemon, HB, and Daemon constitute the skeleton of the ARMOR 
infrastructure. The solid lines are the ARMOR communication channels. An ARMOR element 
called the image keeper is embedded into the Daemon ARMOR to maintain the image 
(checkpoint) of the data structures in SysDB. When a client or service process opens the database, 
the database kernel library creates an Embedded ARMOR (EA) stub within the process and 
establishes the communication channel between the EA and the Daemon. From then on, the 
checkpoint data can be transmitted directly from the source process (with its EA stub) to the 
destination ARMOR, which maintains the image in memory. The image is then stored on disk by 
ARMOR’s checkpoint mechanism. 
 181 
 
Figure 41: Basic Checkpointing Architecture 
Arrows in Figure 41 depict the data flow during system operation. Each client or service, when it 
acquires a mutex (or releases a mutex, depending on the checkpointing strategy applied), sends 
the related checkpoint through the ARMOR communication channel to the image keeper. The 
image keeper processes the message according to the checkpointing strategy. If there is no error, 
the checkpoint reflects the latest consistent state of the SysDB. When a client/service crashes 
while holding a mutex, the cleanup service requests from the image keeper the saved correct copy 
of the relevant data structure(s). On the successful restoration of the data, the cleanup service 
allows the system to continue normal execution without invoking a major recovery. 
B.5. Checkpointing Algorithms 
This section discusses two algorithms for checkpointing control structures of the target database 
system: incremental checkpointing and delta checkpointing. We begin with a brief description, 
summarized in Table 22, of similarities and differences between the two proposed alternatives. 
 182 
Table 22: Comparison of Incremental and Delta Checkpointing Algorithms 
 
B.5.1. Incremental Checkpointing 
In the incremental checkpointing scheme, only updates (incremental changes) to data are sent to 
the image keeper. The basic algorithm is as follows: 
 After system initialization, the database server sends the image of all the control structures to 
the image keeper. 
 In the following processing, a client/service acquires a mutex and then performs operations 
on the control structures. On each write operation, any changes to the data are stored in the local 
buffer. 
 After all updates are successfully finished, the mutex is released, and the client/service 
delivers the buffered increments to the image keeper for maintaining the up-todate checkpoint of 
the control structures. 
 Upon a crash while the mutex is held, the cleanup service requests from the image keeper the 
latest checkpoint data and restores the corrupted control structures. 
Handling mutex overlaps. In some cases, a single section of control structures is protected by 
 183 
multiple mutexes. To properly handle this scenario, the image keeper maintains, in addition to the 
checkpoint, the mapping between a mutex and the data section(s) protected by this mutex. To 
assist in the mapping, the checkpoint increments sent to the image keeper piggyback the mutex id 
and the information necessary to identify the correct sections in the control structures.  
Figure 42 depicts an example configuration of control structure images kept in the image keeper 
and illustrates the mapping of mutexes to control structures. With this mapping, overlapped data 
sections can be protected and their consistency with the corresponding copies in SysDB can be 
preserved, even when multiple mutexes are acquired at the same time. 
Handling data access without mutex protection. The proposed algorithm works correctly as long 
as all updates to control structures are performed within the mutex blocks. There are, however, 
cases in which control structures are updated directly, without mutex protection, e.g. during 
database initialization (when it is assumed that no processes try to access the database). While this 
example is a rather benign case, practice shows that application developers often make somewhat 
arbitrary decisions and allow the accessing of control structures without mutex protection.29 
Handling such scenarios would require (i) locating, in the application code, all the places of 
potential updates outside mutex blocks and (ii) augmenting the implementation to ensure the 
checkpoint in the image keeper is up-to-date. This can be difficult given the size and complexity 
of real-world applications, such as our target database system. Delta checkpointing, discussed 
next, is an attempt to alleviate this problem. 
                                                        
29
 Identification of all cases of updates outside mutex blocks would require reviewing/profiling the entire code base. 
We could not do this due to limited access to the database. 
 184 
 
Figure 42: Control Structure Images 
B.5.2. Delta Checkpointing 
The delta checkpointing algorithm is based on the following assumption: In a correctly 
implemented system, any access to control structures outside the mutex blocks, after system 
initialization, does not violate data consistency. Consequently, crashes outside the mutex blocks 
do not cause data inconsistency, and sections of control structures not updated by any currently 
executing mutex block are always consistent.30 As a result, the image keeper does not need to 
maintain a copy of all control structures (as in incremental checkpointing, discussed in the 
previous section). It is sufficient to preserve data sections modified (plus information on the type 
and parameters of the update operation) while executing a given mutex block. In other words, the 
algorithm only needs to recognize, collect, and send to the image keeper the modified data section, 
delta (delta is a before-image). Upon a failure of a client/service while executing a mutex block, 
the primary copy of the control structures still exists in the shared memory. The entire image of 
the related structures (base) can then be delivered to the image keeper. Using delta and base, the 
image keeper computes the original (at the time of entering the mutex block) control structures 
                                                        
30
 Under this assumption, incremental checkpointing (Section 5.1) still needs to determine all locations in the code 
where SysDB is updated. 
 185 
image (orig =base+delta) and sends it back to the cleanup process for recovering the data. The 
possible updates include data insertion, deletion, and replacement. In summary, the algorithm 
includes two basic steps: 
(1) When a client/service acquires a mutex, it sends the delta to the image keeper. The delta's 
content depends on the update to be performed in the mutex block. 
(2) Upon a crash while the mutex is held, the cleanup service sends the base to the image keeper. 
The image keeper merges the base with the saved delta and generates the valid image of control 
structures. The cleanup service requests from the image keeper the regenerated image and 
restores the corrupted control structures. 
Observe that the image keeper merges the base with the latest delta it receives. To avoid using the 
wrong delta in the case of recovery, it is important to send out a delta at each mutex acquisition, 
even if the delta is empty, i.e., no changes to the control structures were performed during the 
current mutex block. 
B.5.3. Image Keeper 
The image keeper is a separate element within the ARMOR process that collects and maintains 
checkpoint data representing the correct state of control structures. It is a passive component, 
which means that it is only invoked by the incoming messages (checkpoint updates) and that it 
performs proper actions according to the received messages. Figure 43 illustrates the basic 
structure of the image keeper, which consists of (i) a set of memory blocks for preserving images 
of control data structures (control structure images) and (ii) management support (manager) for 
updates of the checkpoint and recovery actions in response to client failures. The image keeper 
 186 
communicates with the database processes by means of the ARMOR communication channel. 
The control structure images in Figure 43 represent a memory pool that stores the images of 
control structures in SysDB. Different mutexes map their corresponding data sections into the 
copy of control structures in the image keeper (Figure 42). 
 
Figure 43: Structure of the Image Keeper 
B.6. Performance Evaluation 
This section presents performance measurements of the prototype implementation of the 
ARMOR-based incremental checkpointing scheme applied to the target database system.31 The 
testbed consists of a Sun Blade 100 workstation running the Solaris 8 operating system on top of a 
500 MHz UltraSPARC-II CPU with 128 MB of memory. The measurements are conducted in (i) 
error-free scenarios, in which normal operation of the database under a synthetic workload 
mimics actual database activity, and (ii) error-recovery scenarios, in which the database recovers 
from the checkpoint after a failure while executing transactions issued by synthetic clients. While 
                                                        
31
 Due to the limited time for accessing the target system, we provide measurements only for incremental 
checkpointing. Because checkpoint data transmission time dominates the performance of both schemes in error-free 
scenarios and both algorithms transmit a similar amount of data, it is expected that the performance of the two 
schemes is similar. 
 187 
checkpointing is applied in the context of the file table mutex, the proposed solution applies to 
other mutexes as well. 
B.6.1. Performance of ARMOR-based Incremental Checkpointing in Error-Free Scenarios 
Workload. Each workload invocation involves execution of a sequence of transactions, which 
arrive with a predefined frequency, and each transaction is represented as a set of operations of 
variable execution time. Some of the operations need to acquire the file table mutex (mutex) to 
preserve mutual exclusion in accessing shared data by multiple clients. 
Operations associated with mutex acquisition can be either data write (write) or data read (read). 
The operation pattern within a transaction is a sequence of alternate reads and writes, e.g., read – 
write – read – write – …. The sleep function is used to emulate the execution time of operations 
that (i) do not require mutex acquisition (mutex-free operations), (ii) occur when the mutex is held 
(mutex operations), or (iii) represent idle time, i.e., the period after completion of a current 
transaction and before arrival of the next. 
Parameters. The workload is flexible and can be configured to mimic actual execution scenarios. 
The tunable workload parameters (and other experiment settings) are as follows: 
 Transaction frequency (freq) – number of transactions arriving within one second. 
 Number of mutex acquisitions per transaction (num_acq). 
 Percentage of read-only operations (read_per) – fraction of mutex acquisitions for read-only 
operations. 
 Mutex operation time (mutex_op) – processing time while holding the mutex, i.e., the interval 
 188 
between the time the mutex is granted and the time it is released.  
 Mutex-free operation time (other_op) – processing time for mutex-free operations within the 
transaction. 
 Delivered data – the amount of checkpointed data. In the case of the file table mutex, the 
typical data to be checkpointed is a single file entry in the file table (approx. 3700 bytes). In all 
measurements, the size of the checkpointed data is assumed to be 4000 bytes. 
 Experiment duration – time duration of the experiment. The transaction frequency should 
satisfy the following requirement for experiments to run correctly: 
1/freq >= mutex_op*num_acq + other_op + time{get/release mutexes+checkpointing} 
Results. The workload configuration parameters for performance measurements are as follows: 
num_acq=5, mutex_op=0.002 s, other_op=0.03 s, delivered data=4000 bytes, experiment 
duration=20 s. 
Table 23 shows the time per transaction (with 95% confidence intervals) for four transaction 
frequencies and with read-only percentages (read_per) ranging from 0% to 100%. The transaction 
time includes mutex-free operations, mutex operations, mutex acquisition/release, and 
checkpointing time. Table 24 depicts the performance overhead of checkpointing per transaction. 
The case of read_per=100% is the one without checkpointing, and hence, it is the baseline against 
which the overheads in other scenarios are computed and compared. 
 189 
Table 23: Transaction Time [s] 
 
Table 24: Performance Overhead of Checkpointing per Transaction [s] 
 
From the results, one can see that as transaction arrival rate increases, the performance overhead 
and, hence, the transaction time increases. This is because when there are more requests, 
ARMORs take more time to process each individual request. However, the frequency increase 
does not significantly degrade the performance; the overhead ranges from 1 ms to 28 ms for 80% 
and 0% (all transactions are write) read operations, respectively. If more than half of the mutex 
acquisitions are read-only, the performance overhead is very small. Note that the variation of the 
run time without checkpointing (last column in Table 23) can dominate the performance overhead 
(columns 5 and 6 in Table 23). In real applications, more than 50% of mutex acquisitions are for 
 190 
read-only operations; the measurement data indicate that under such workloads the overhead due 
to checkpointing is negligible. 
B.6.2 Performance of ARMOR-Based Incremental Checkpointing in Error-Recovery 
Scenarios 
The database used in the test consists of five db files. Each file contains 100 tables, and each table 
contains two thousand 200-byte records. So the total size of the user database is 200 MB, a typical 
size for the database the target system processes in practice. Three clients are used in the error 
recovery scenario: (i) testsc, which updates the table records one after another without acquiring a 
file table mutex, (ii) testsc_ftmutex, which repeatedly acquires a file table mutex, updates the table 
records, and releases the mutex, and (iii) ftmutextest, which gets a file table mutex and emulates a 
crash while still holding the mutex. Table 25 presents measurements comparing the performances 
of both major recovery and ARMOR-based incremental checkpoint recovery. The time listed in 
Table 25 represents the recovery time, i.e., the time from the crash of the failed client (ftmutextest) 
to the first successful acquisition of a file table mutex by the waiting client (testsc_ftmutex). 
Table 25: Performance of Major Recovery and ARMORbased Incremental Checkpointing 
 
Four experiments (each is performed for three trials) are conducted with different clients and 
 191 
recovery policies. The results reported in Table 25 indicate: 
 Major recovery can cause significant system downtime (11 to 31 seconds in our experiments). 
The downtime depends on how much data is loaded into memory when major recovery occurs. 
(Testsc_ftmutex updates a small fraction of table data, so recovery time is small. When 
testsc+testsc_ftmutex is used, since testsc updates a whole table, the loaded data is much larger, 
and recovery time is greater.) 
 ARMOR-based incremental checkpointing eliminates or significantly reduces the downtime 
due to the client crashes: (i) the crash of a client does not impact other processes as long as they do 
not acquire the same mutex as the terminated client, (ii) an overhead of 2 to 6 seconds (in our 
measurements) is encountered by any process that attempts to acquire the same mutex as the 
terminated client, and (iii) recovery using checkpointing does not depend on the amount of loaded 
data, as there is no need to reload data. 
Availability. Availability of the database is estimated based on the data on recovery time, 
assuming different frequencies of crashes that require major recovery. (We consider database 
availability using Experiment 2 as an example.) The measured average recovery time for major 
recovery and checkpointing-based recovery are 28.3 s and 4.7 s, respectively. Table 26 shows the 
system’s availability for various error frequencies. One can see that under an error rate of one per 
week, checkpointing-based recovery provides about five nines of availability, which is one nine 
of improvement compared with the major-recovery-based solution. 
 192 
Table 26: Availability (Expr. 2) 
 
B.7. Related Work 
A number of checkpoint techniques have been proposed to ensure the durability of MMDBs. In 
Hagmann’s fuzzy checkpointing [107], the checkpoint is taken while the transaction is in progress. 
An improved variant of fuzzy checkpointing is proposed in [108] and [109]. Non-fuzzy 
checkpointing algorithms are introduced in [110], [111] and [112].  
Levy and Silberschatz [113] design an incremental checkpointing scheme that decouples 
transaction processing and checkpointing. The propagator component observes the log at all 
times and propagates the updates of the primary copy in memory to the backup copy on disk. 
While these traditional techniques rely on control structures to checkpoint user data, we address 
checkpointing of control structures themselves. 
Sullivan and Stonebraker [114] investigate the use of hardware memory protection to prevent 
erroneous (due to addressing errors) writes to the data structures. In [115], Bohannon et al. 
achieve such protection by computing a codeword over a region in the data structures. Upon a 
write, the data region and the associated codeword are updated. A wild write results in an 
incorrect codeword, which triggers recovery of the corrupted data region. These schemes protect 
the critical control structures against erroneous writes. Our checkpointing algorithms defend 
against client crashes and data inconsistency, which is a different failure model. Another 
technique that addresses this type of failure is process duplication. For example, Tandem’s 
 193 
process-pair mechanism [116] provides a spare process for the primary one. The primary executes 
transactions and sends checkpoint messages to the spare. If the primary fails, the spare 
reconstructs the consistent state from the checkpoint messages. The idea of lightweight, 
recoverable virtual memory in the context of providing transactional guarantees to applications is 
explored in [117]. A Rio Vista system for building high-performance recoverable memory for 
transactional systems is proposed in [118]. 
B.8. Conclusions 
This appendix presents ARMOR-based, transparent, and performance-efficient recovery of 
control structures in a commercial MMDB. The proposed generic solution allows eliminating or 
significantly reducing cases requiring major recovery and, hence, significantly improves 
availability. The solution can be easily adapted to provide system-wide detection and recovery. 
Performance measurements and availability estimates show that the proposed ARMOR-based 
checkpointing scheme enhances database availability while keeping performance overhead quite 
small (less than 2% in a typical workload of real applications). 
 
 
 194 
REFERENCES 
[1] AIX 5L for POWER Version 5.3 datasheet, IBM Corporation, 2004.  
[2] A. Robertson, “The evolution of the Linux-HA project,” presented at UKUUG 
LISA/Winter Conference on High-Availability and Reliability, Bournemouth, UK, 2004.  
[3] C. Clark et al., “Live migration of virtual machines,” in Proceedings of USENIX 
Symposium on Networked Systems Design and Implementation, 2005, pp. 273-286. 
[4] G. Hoglund and J. Butler, Rootkits: Subverting the Windows Kernel. Stoughton, MA: 
Addison-Wesley Professional, 2005. 
[5] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, “An empirical study of operating 
systems errors,” in Proceedings of ACM Symposium on Operating Systems Principles, 
2001, pp. 73-88. 
[6] IA-32 Intel® Architecture Software Developer’s Manual, Volume 3: System Programming 
Guide, Intel Corp., 2006. 
[7] P. Koopman and J. DeVale, “The exception handling effectiveness of POSIX operating 
systems,” IEEE Transactions on Software Engineering, vol. 26, no. 9, pp. 837-848, 2000. 
[8] Y. S. Li, S. Malik, and A. Wolfe, “Performance estimation of embedded software with 
instruction cache modeling,” ACM Transactions on Design Automation of Electronic 
Systems, vol. 4, no. 3, pp. 257-279, 1999. 
[9] J. S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: transparent checkpointing under 
Unix,” in Proceedings of the Usenix Winter 1995 Technical Conference, New Orleans, LA, 
January 1995, pp. 213-223. 
 195 
[10] S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou, “Flashback: a lightweight 
extension for rollback and deterministic replay for software debugging,” in Proceedings of 
USENIX Annual Technical Conference, General Track, 2004, pp. 29-44. 
[11] L. Stein and D. MacEachern, Writing Apache Modules with Perl and C. Sebastopol, CA: 
O'Reilly & Associates, Inc., 1999.  
[12] D. T. Stott, B. Floering, Z. Kalbarczyk, and R. K. Iyer, “Dependability assessment in 
distributed systems with lightweight fault injectors in NFTAPE,” in Proceedings of IEEE 
International Computer Performance and Dependability Symposium, pp. 91-100, 2000. 
[13] G. Trent and M. Sake, WebSTONE: The First Generation in HTTP Server Benchmarking, 
white paper, MTS Silicon Graphics, 1995.  
[14] Y. Huang and C. Kintala, “Software-implemented fault tolerance: technologies and 
experience,” in Proceedings of IEEE International Symposium on Fault-Tolerant 
Computing Symposium, 1993, pp. 2-9. 
[15] L. Wang, Z. Kalbarczyk, W. Gu, and R. K. Iyer, “Reliability MicroKernel: providing 
application-aware reliability in the OS,” IEEE Transactions on Reliability, vol. 56, no. 4, 
pp. 597-614, 2007. 
[16] D. Beauregard, “Error injection-based failure profile of the IEEE 1394 bus,” M.S. thesis, 
University of Illinois at Urbana-Champaign, 2003. 
[17] PCI Watch Dog Timer Adapter Operation Manual, Decision Computer International, 
Germany, Nov. 2010. [Online]. Available:  
http://www.decision-computer.de/downld/Handbuch-Archiv/Watchdog/PCI%20WATCH
%20DOG%20Timer%20Adapter.pdf 
 196 
[18] D. Bovet and M. Cesati, “Checking the NMI watchdogs,” in Understanding the Linux 
Kernel, 3rd ed. A. Oram, editor. Sebastopol, CA: O'Reilly & Associates, Inc., 2005, pp. 
243-244. 
[19] M. Clavel et al., “The Maude 2.0 system,” in Proceedings of Rewriting Techniques and 
Applications, Springer-Verlag LNCS 2706, June 2003, pp. 76-87. 
[20] D. Cyrluk, “Microprocessor verification in PVS: a methodology and simple example,” 
SRI International, Menlo Park, CA, Technical Report SRI-CSL-93-12, 1993. 
[21] E. Börger, G. Fruja, V. Gervasi, and R. F. Stark, “A high-level modular definition of the 
semantics of C#,” Theoretical Computer Science, vol. 336, no. 2, pp. 235-284, 2005. 
[22] M. Barnett, E. Börger, Y. Gurevich, W. Schulte, and M. Veanes, “Using abstract state 
machines at Microsoft: a case study,” Lecture Notes in Computer Science of Abstact State 
Machines - Theory and Applications, vol. 1912, Springer-Verlag, 2000, pp. 367-379. 
[23] G. J. Holzmann, “The model checker SPIN,” IEEE Transactions on Software Engineering, 
vol. 23, no. 5, pp. 279-295, 1997. 
[24] D. Zhou and P. E. Black, “Formal specification of operating system operations,” in 
Proceedings of IEEE TC-ECBS and IFIP WG10.1 Joint Workshop on Formal 
Specifications of Computer Based Systems, 2001, pp. 69-73.  
[25] L. Luo and M.-Y. Zhu, “Partitioning-based operating system: a formal model,” Operating 
Systems Review, vol. 37, no. 3, pp. 23-35, 2003. 
[26] T. Ball and S. K. Rajamani, “The SLAM project: debugging system software via static 
analysis,” in Proceedings of ACM Symposium on Principles of Programming Languages, 
2002, pp. 1-3. 
 197 
[27] R. Kolanski and G. Klein, “Formalising the L4 microkernel API,” in Proceedings of 
Computing: The Australasian Theory Symposium, 2006, pp. 53-68. 
[28] K. Elphinstone, G. Klein, and R. Kolanski, “Formalising a high-performance 
microkernel,” in R. Leino, editor, Workshop on Verified Software: Theories, Tools, and 
Experiments, Microsoft Research, Redmond, WA, Technical Report MSR-TR-2006-117, 
pp. 1–7, 2006. 
[29] F. Chen and G. Rosu, “MOP: an efficient and generic runtime verification framework,” in 
Proceedings of ACM Object-Oriented Programming, Systems, Languages & Applications, 
2007, pp. 568-588. 
[30] A. Farzan, F. Chen, J. Meseguer, and G. Rosu, “Formal analysis of Java programs in 
JavaFAN,” in Lecture Notes in Computer Science of Computer Aided Verification, vol. 
3114, Springer-Verlag, 2004, pp. 501-505. 
[31] S. Chen, J. Xu, N. Nakka, Z. Kalbarczyk, and R. K. Iyer, “Defeating memory corruption 
attacks via pointer taintedness detection,” in Proceedings of IEEE/IFIP International 
Conference on Dependable Systems and Networks, 2005, pp. 378-387. 
[32] A. Cimatti, E. Clarke, F. Giunchiglia, and M. Roveri, “NuSMV: a new symbolic model 
verifier,” in Lecture Notes in Computer Science of Computer Aided Verification, vol. 1633, 
Springer-Verlag, 1999, pp. 495-499. 
[33] S. Owre and N. Shankar, “The formal semantics of PVS,” SRI International, Menlo Park, 
CA, Technical Report NASA/CR-1999-209321, 1997. 
[34] K. Hristova, “Improved algorithm complexities for linear temporal logic model checking 
of pushdown systems,” Lecture Notes in Computer Science of Verification, Model 
Checking, and Abstract Interpretation, vol. 3855, Springer-Verlag, 2006, pp. 190-206. 
 198 
[35] Z. Kalbarczyk, R. K. Iyer, and L. Wang, “Application fault tolerance with ARMOR 
middleware,” IEEE Internet Computing, vol. 9, no. 2, pp. 28-37, 2005. 
[36] K. Chanchio, C. Leangsuksun, H. Ong, V. Ratanasamoot, and A. Shafi, “An efficient 
virtual machine checkpointing mechanism for hypervisor-based HPC systems,” in 
Proceedings of High Availability and Performance Computing Workshop, 2008. 
[37] A. Kangarlou, P. Eugster, and D. Xu, “VNsnap: taking snapshots of virtual networked 
environments with minimal downtime,” in Proceedings of IEEE/IFIP International 
Conference on Dependable Systems and Networks, 2009, pp. 524-533. 
[38] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield, “Remus: 
high availability via asynchronous virtual machine replication,” in Proceedings of 
USENIX Symposium on Networked Systems Design and Implementation, 2008, pp. 
161-174. 
[39] W. Gu, Z. Kalbarczyk, and R. K. Iyer, “Error sensitivity of the Linux kernel executing on 
PowerPC G4 and Pentium 4 processors,” in Proceedings of IEEE/IFIP International 
Conference on Dependable System and Networks, 2004, pp. 887-896. 
[40] M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou, 
“Understanding the propagation of hard errors to software and implications for resilient 
system design,” in Proceedings of ACM International Conference on Architectural 
Support for Programming Languages and Operating Systems, 2008, pp. 265-276. 
[41] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, “SWIFT: software 
implemented fault tolerance,” in Proceedings of the International Symposium on Code 
Generation and Optimization, 2005, pp. 243-254. 
 199 
[42] K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Automated derivation of 
application-aware error detectors using static analysis,” in Proceedings of IEEE 
International On-Line Testing Symposium, 2007, pp. 211-216. 
[43] J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, 
“Fingerprinting: bounding soft-error detection latency and bandwidth,” in Proceedings of 
ACM International Conference on Architectural Support for Programming Languages 
and Operating Systems, 2004, pp. 224-234. 
[44] J. Gray, “Why do computers stop and what can be done about it?” in Proceedings of the 
Symposium on Reliability in Distributed Software and Database Systems, 1986, pp. 3-12. 
[45] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, “ReViveI/O: efficient 
handling of I/O in highly-available rollback-recovery servers,” in Proceedings of IEEE 
International Symposium on High Performance Computer Architecture, 2006, pp. 
200-211. 
[46] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science 
Applications. 2nd ed. Hoboken, NJ: John Wiley & Sons, Inc., 2002. 
[47] J. Demmel, X. Li, C. Puscasiu, and S. Timson. (1993, Sep.). The CLAPACK free software 
package. [Online]. Available: http://www.netlib.org/clapack/ 
[48] J. Xu, Z. Kalbarczyk, and R. K. Iyer, “Networked Windows NT system field failure data 
analysis,” in Proceedings of IEEE Pacific Rim International Symposium on Dependable 
Computing, 1999, pp. 178-185.  
 200 
[49] H. P. Reiser, F. J. Hauck, R. Kapitza, and W. Schroder-Preikschat, “Hypervisor-based 
redundant execution on a single physical host,” in Proceedings of IEEE European 
Dependable Computing Conference, supplemental volume, 2006, pp. 2-7. 
[50] S. Chandra and P. M. Chen, “The impact of recovery mechanisms on the likelihood of 
saving corrupted state,” in Proceedings of the International Symposium on Software 
Reliability Engineering, 2002, pp. 91-101. 
[51] R. Rashid et al., “Mach: a foundation for open systems,” in Proceedings of the 2nd 
Workshop on Workstation Operating System, 1989, pp. 109-113. 
[52] M. Rozier et al., “Overview of the CHORUS distributed operating systems,” in 
Proceedings of USENIX Workshop on Microkernels and Other Kernel-Architectures, 1992, 
pp. 39-69. 
[53] F. Salles, M. Rodriguez, J.-C. Fabre, and J. Arlat, “MetaKernels and fault containment 
wrappers,” in Proceedings of the International Symposium on Fault-Tolerant Computing, 
1999, pp. 22-29. 
[54] N. Nakka, Z. Kalbarczyk, R. K. Iyer, and J. Xu, “An architectural framework for 
providing reliability and security support,” in Proceedings of IEEE/IFIP International 
Conference on Dependable Systems and Networks, 2004, pp. 585-594. 
[55] J. Arlat, J.-C. Fabre, M. Rodriguez, and F. Salles, “Dependability of COTS 
microkernel-based systems,” IEEE Transactions on Computers, vol. 51, no. 2, 2002, pp. 
138-163. 
[56] M. Russinovich and Z. Segall, “Fault-tolerance for off-the-shelf applications and 
hardware,” in Proceedings of the International Symposium on Fault-Tolerant Computing, 
1995, pp. 67-71. 
 201 
[57] N. Nakka, G. P. Saggese, Z. Kalbarczyk, and R. K. Iyer, “An architectural framework for 
detecting process hangs/crashes,” in Proceedings of IEEE European Dependable 
Computing Conference, 2005, pp. 103-121. 
[58] S. Osman, D. Subhraveti, G. Su, and J. Nieh, “The design and implementation of Zap: a 
system for migrating computing environments,” in Proceedings of USENIX Symposium on 
Operating Systems Design and Implementation, 2002, pp. 361-376. 
[59] E. Pinheiro. (1997, Dec.). “Truly transparent checkpointing of parallel applications,” 
Federal University of Rio de Janeiro, Brazil, Internal Report. [Online]. Available: 
http://www.research.rutgers.edu/~edpin/epckpt/epckpt.ps.gz. 
[60] L. Wang, Z. Kalbarczyk, R. K. Iyer, H. Vora, and T. Chahande, “Checkpointing of control 
structures in main memory database systems,” in Proceedings of IEEE/IFIP International 
Conference on Dependable Ssystems and Networks, 2004, pp. 687-692. 
[61] P. Bernstein, “Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction 
processing,” IEEE Computer, vol. 21, no. 2, pp. 37-45, 1988. 
[62] N. Neves and W. K. Fuchs, “RENEW: a tool for fast and efficient implementation of 
checkpoint protocols,” in Proceedings of the International Symposium on Fault-Tolerant 
Computing (FTCS), 1998, pp. 58-67. 
[63] L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Modeling coordinated 
checkpointing for large-scale supercomputers,” in Proceedings of IEEE/IFIP 
International Conference on Dependable Systems and Networks, 2005, pp. 812-821. 
[64] M. Beck, J. S. Plank, and G. Kingsley, “Compiler-assisted checkpointing,” University of 
Tennessee, TN, Technical Report UT-CS-94-269, 1994. 
 202 
[65] M. Nelson, B.-H. Lim, G. Hutchins, “Fast transparent migration for virtual machines,” in 
Proceedings of USENIX Annual Technical Conference, General Track, 2005, pp. 
391-394. 
[66] P. Colp, “VM snapshots,” presented at Xen Summit North America at Oracle, Redwood 
City, CA, Feb. 2009. [Online]. Available: 
http://www.xen.org/files/xensummit_oracle09/VMSnapshots.pdf 
[67] R. Bradford, E. Kotsovinos, A. Feldmann, and H. Schioberg, “Live wide-area migration of 
virtual machines including local persistent state,” in Proceedings of ACM International 
Conference on Virtual Execution Environments, 2007, pp. 169-179. 
[68] J. G. Hansen and E. Jul, “Self-migration of operating systems,” in Proceedings of the 11th 
Workshop on ACM SIGOPS European Workshop, 2004, Article No. 23. 
[69] T. Distler, R. Kapitza, and H. P. Reiser, “Efficient state transfer for hypervisor-based 
proactive recovery,” in Proceedings of the 2nd Workshop on Recent Advances on 
Intrusion-Tolerant Systems, 2008, Article No. 4. 
[70] A. B. Nagarajan and F. Mueller, “Proactive fault tolerance for HPC with Xen 
virtualization,” in Proceedings of International Conference on Supercomputing, 2007, pp. 
23-32. 
[71] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif, “Black-box and gray-box 
strategies for virtual machine migration,” in Proceedings of USENIX Symposium on 
Networked Systems Design and Implementation, 2007, pp. 229-242. 
 203 
[72] M. Vrable et al., “Scalability, fidelity, and containment in the Potemkin virtual 
honeyfarm,” in Proceedings of ACM Symposium on Operating Systems Principles, 2005, 
pp. 148-162. 
[73] H. A. Lagar-Cavilla, J. A. Whitney, A. Scannell, P. Patchin, and S. M. Rumble, 
“SnowFlock: rapid virtual machine cloning for cloud computing,” in Proceedings of ACM 
European Conference on Computer Systems, 2009, pp. 1-12. 
[74] IBM Technical Staff. (2006). Developer for System z, Version 7.0, Enterprise COBOL for 
z/OS, Version 3.4, Programming Guide. IBM Corporation, New York, NY. [Online]. 
Available: 
http://publib.boulder.ibm.com/infocenter/ratdevz/v7r1m1/index.jsp?topic=/com.ibm.ent.c
bl.zos.doc/topics/tpchk04.htm 
[75] L. Gruenwald, J. Huang, M. H. Dunham, J.-L. Lin, and A. C. Peltier, “Survey of recovery 
in main memory databases,” Engineering Intelligent Systems, vol. 4, no. 3, pp. 177-184, 
1996. 
[76] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: cost-effective architectural support for 
rollback recovery in shared-memory multiprocessors,” in Proceedings of ACM IEEE 
International Symposium on Computer Architecture, 2002, pp. 111-122. 
[77] D. J. Sorin, M. M. K. Matin, M. D. Hill, and D. A. Wood, “SafetyNet: improving the 
availability of shared memory multiprocessors with global checkpoint/recovery,” in 
Proceedings of ACM IEEE International Symposium on Computer Architecture, 2002, pp. 
123-134. 
[78] N.R. Adiga et al., “An overview of the Blue Gene/L supercomputer,” in Proceedings of 
IEEE International Conference on Supercomputing, November 2002, Article No. 60. 
 204 
[79] M. Chandy and L. Lamport, “Distributed snapshots: determinining global states of 
distributed systems,” ACM Transactions on Computing Systems, vol. 3, no. 1, pp. 63-75, 
1985. 
[80] J. W. Young, “A first order approximation to the optimum checkpoint interval,” 
Communications of the ACM, vol. 17, no. 9, pp. 530-531, 1974. 
[81] J. Daly, “A model for predicting the optimum checkpoint interval for restart dumps,” 
Lecture Notes in Computer Science of International Conference on Computational 
Science, vol. 2660, Springer-Verlag, 2003, pp. 3-12. 
[82] G. P. Kavanaugh and W. H. Sanders. “Performance analysis of two time-based 
coordinated checkpointing protocols,” in Proceedings of IEEE Pacific Rim International 
Symposium on Fault Tolerant Systems, 1997, pp. 194-201. 
[83] J. S. Plank and M. G. Thomason: “The average availability of parallel checkpointing 
systems and its importance in selecting runtime parameters,” in Proceedings of IEEE 
Internatinoal Symposium on Fault-Tolerant Computing, 1999, pp. 250-257. 
[84] E. N. Elnozahy, J. S. Plank, and W. K. Fuchs, “Checkpointing for peta-scale systems: a 
look into the future of practical rollback-recovery,” IEEE Transactions on Dependable 
and Secure Computing, vol. 1, no. 2, pp. 97-108, 2004. 
[85] N. H. Vaidya, “On checkpoint latency,” in Proceedings of IEEE Pacific Rim International 
Symposium on Fault-Tolerant Systems, 1995, pp. 60-65. 
[86] F. Petrini, K. Davis, and J. C. Sancho, “System level fault tolerance in large-scale parallel 
machines,” in Proceedings of IEEE International Parallel and Distributed Processing 
Symposium, 2004, pp. 209-216. 
 205 
[87] D. Tang and R. K. Iyer, “Analysis and modeling of correlated failures in multicomputer 
systems,” IEEE Transaction on Computers, vol. 41, no. 5, pp. 567-577, 1992. 
[88] G. Kulkarni, V. F. Nicola, and K. S. Trivedi, “The completion time of a job on multimode 
systems,” Advances in Applied Probability, vol. 19, no. 4, pp. 932-954, 1987. 
[89] G. Bronevetsky, D. Margues, K. Pingali, and P. Stodghill, “Automated application-level 
checkpointing of MPI programs,” in Proceedings of ACM SIGPLAN Symposium on 
Principles and Practice of Parallel Programming, 2003, pp. 84-94. 
[90] S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira, “Adaptive incremental 
checkpointing for massively parallel systems,” in Proceedings of IEEE International 
Conference on Supercomputing, 2004, pp. 277-286. 
[91] Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, “Performance 
implications of failures in large-scale cluster scheduling,” in Lecture Notes in Computer 
Science of the 10th Workshop on Job Scheduling Strategies for Parallel Processing, vol. 
3277, Springer-Verlag, 2004, pp. 233-252. 
[92] B. Tuthill, K. Johnson, and T. Schultz, IRIX Checkpoint and Restart Operation Guide, 
Silicon Graphics, Inc., Mountain View, CA, 1999. 
[93] R. Koo and S. Toueg, “Checkpointing and recovery rollback for distributed systems,” 
IEEE Transactions on Software Engineering, vol. 13, no. 1, pp. 23-31, 1987. 
[94] L. G. Valiant, “A bridging model for parallel computation,” Communications of the ACM, 
vol. 33, no. 8, pp. 103-111, 1990. 
[95] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems: Design and Evaluation. 2nd 
ed. Newton, MA: Digital Press, 1992. 
 206 
[96] L. Spainhower and T. A. Gregg, “IBM S/390 parallel enterprise server G5 fault tolerance: 
a historical perspective,” IBM Journal of Research and Development, vol. 43, no. 5, pp. 
863-873, 1999. 
[97] R. K. Iyer and D. Rossetti, “A measurement-based model for workload dependance of 
CPU errors,” IEEE Transaction on Computers, vol. C-35, no. 6, pp. 511-519, 1986. 
[98] L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Modeling coordinated 
checkpointing for large-scale supercomputers,” University of Illinois at 
Urbana-Champaign, IL, Internal Report, 2005. 
[99] T. Courtney, D. Daly, S. Derisavi, V. Lam, and W. H. Sanders, “The Möbius modeling 
environment,” presented at Tools of the 2003 Illinois International Multiconference on 
Measurement, Modelling, and Evaluation of Computer Communication Systems, 
Universität Dortmund Fachbereich Informatik, 2003. 
[100] E. Rosti, G. Serazzi, E. Smirni, and M. S. Squillante, “Models of parallel applications with 
large computation and I/O requirements,” IEEE Transactions on Software Engineering, 
vol. 28, no. 3, pp. 286-307, 2002. 
[101] H. Garcia-Molina and K. Salem, “Main memory database systems: an overview,” IEEE 
Transactions on Knowledge and Data Engineering, vol. 4, no. 6, pp. 509-516, 1992. 
[102] Z. Kalbarczyk, R. K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: a software 
infrastructure for adaptive fault tolerance,” IEEE Transactions on Parallel and 
Distributed Systems, vol. 10, no. 6, pp. 560-579, 1999. 
[103] K. Whisnant, Z. Kalbarczyk, and R. K. Iyer, “A system model for dynamically 
reconfigurable software,” IBM Systems Journal, vol. 42, no. 1, pp. 45-59, 2003. 
 207 
[104] P. Bohannon, R. Rastogi, A. Silberschatz, and S. Sudarshan, “The architecture of the Dali 
main memory storage manager,” Journal of Multimedia Tools and Applications, vol. 4, no. 
2, pp. 115-151, 1997. 
[105] P. Bohannon, D. Lieuwen, R. Rastogi, A. Silberschatz, S. Seshadri, and S. Sudarshan, 
“Distributed multi-level recovery in main memory databases,” in Proceedings of the 
International Conference on Parallel and Distributed Information Systems, 1996, pp. 
41-71. 
[106] K. Whisnant, R. K. Iyer, Z. Kalbarczyk, and P. Jones, “An experimental evaluation of the 
REE SIFT environment for spaceborne applications,” in Proceedings of IEEE/IFIP 
International Conference on Dependable Systems and Networks, 2002, pp. 585-594. 
[107] R. B. Hagmann, “A crash recovery scheme for a memory-resident database system,” IEEE 
Transactions on Computers, vol. 35, no. 9, pp. 839-843, 1986. 
[108] X. Li et al., “Checkpointing and recovery in partitioned main memory databases,” in 
Proceedings of IASTED/ISMM International Conference on Intelligent Information 
Management Systems, 1995, pp. 59-63. 
[109] J. Lin and M. Dunham, “A performance study of dynamic segmented fuzzy checkpointing 
in memory resident databases,” Southern Methodist University, Dallas, TX, Technical 
Report TR96-CSE-14, 1996. 
[110] J. Huang and L. Gruenwald, “An update-frequency-valid interval partition checkpoint 
technique for realtime main memory databases,” in Proceedings of Workshop on 
Real-Time Databases, 1996, pp. 135-143. 
 208 
[111] T. Lehman and M. Carey, “A recovery algorithm for a high-performance, 
memory-resident database system,” in Proceedings of ACM SIGMOD International 
Conference on Management of Data, 1987, pp. 104-117. 
[112] K. Salem and H. Garcia-Molina, “Checkpointing memory resident databases,” in 
Proceedings of International Conference on Data Engineering, 1989, pp. 452-462. 
[113] E. Levy and A. Silberschatz, “Incremental recovery in main memory database systems,” 
IEEE Transactions on Knowledge and Data Engineering, vol. 4, no. 6, pp. 529-540, 1992. 
[114] M. Sullivan and M. Stonebraker, “Using write-protected data structures to improve 
software fault tolerance in highly available database management systems,” in 
Proceedings of International Conference on Very Large Databases, 1991, pp. 171-180. 
[115] P. Bohannon, R. Rastogi, S. Seshadri, A. Silberschatz, and S. Sudarshan, “Detection and 
recovery techniques for database corruption,” IEEE Transactions on Knowledge and Data 
Engineering, vol. 15, no. 5, pp. 1120-1136, 2003. 
[116] J. Bartlett. “A nonstop kernel,” in Proceedings of ACM Symposium on Operating Systems 
Principles, 1981, pp. 22-29. 
[117] M. Satyanarayanan, H. H. Mashburn, P. Kumar, D. C. Steere, and J. J. Kistler, 
“Lightweight recoverable virtual memory,” in Proceedings of ACM Symposium on 
Operating Systems Principles, 1993, pp. 146-160. 
[118] D. Lowell and P. Chen. “Free transactions with Rio Vista,” in Proceedings of ACM 
Symposium on Operating Systems Principles, 1997, pp. 92-101. 
 
