The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-de ned execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that a ects the computation. This simple model excludes many forms of nondeterminism that exist in practice. In particular, applications may experience nondeterminism due to receiving asynchronous signals or interrupts at arbitrary execution points. Also, multithreaded applications experience nondeterminism due to interleaved shared memory access among threads. We present a solution that removes these restrictions and enables support for these two forms of nondeterminism in log-based rollback-recovery.
Introduction

Problem Description and Motivations
Many log-based rollback-recovery protocols have been proposed in the literature 1, 2, 5, 6, 11, 14, 15, 17{22, 24, 26, 27, 29, 30, 34{41] . These protocols use a combination of process checkpointing and message logging to recover from failures. A checkpoint typically contains a snapshot of the process's state and su cient information to restart the computation from the execution point at which the state was saved. The log typically contains the messages that a process receives during normal operation. Should a failure occur, a process restarts from a saved checkpoint and reconstructs its prefailure execution by replaying the messages from the log.
Di erent avors of logging have been suggested with di erent performance and resilience characteristics 2]. Despite these di erences, most protocols commonly rely on the piecewise deterministic execution model 13] . Process execution in this model consists of a sequence of intervals, each starting with the receipt of a message at an application-de ned execution point. Execution within each interval is deterministic, and thus the outcome depends on the state at the beginning of the interval and the message that starts it. Starting from a given initial state, a process will always reach the same nal state if it receives the same messages in the same order in each run.
Messages in this model are the only form of nondeterminism that a ects the computation. Previous research has proposed that to realize this execution model, an implementation should convert all sources of nondeterminism found in practice into messages that processes receive at application-de ned execution points 34]. In practice, however, this realization has proved di cult as there are several forms of nondeterminism that cannot be logged e ciently or converted into synchronous messages 6, 13] . We consider two such forms, namely nondeterminism that results from asynchronous software signals, and nondeterminism that results from shared memory access in a multithreaded application. The di culty of logging these two forms of nondeterminism lies in their asynchronous nature. Software signals, for instance, must be applied at the same execution points during an execution replay to reproduce the same result. Figure 1 shows an example of how a signal could a ect the ow of execution. The e ect of the signal is to change the value of the variable i. During recovery, if the signal is not replayed in the same location, the execution will be di erent from the pre-failure run. Also, it is not su cient to replay this signal at the same address within the program's code. It is also necessary to replay it at the same iteration through the loop so that loop count will be consistent with the pre-failure execution value.
Nondeterminism that results from shared memory manipulation in multithreaded applications is also di cult to track. Reconstructing the same execution during replay requires the same interleaving of shared memory accesses by the various threads as in the pre-failure execution. Consider for example the scenario in Figure 2 . Three threads, A, B, and C run on the same process within a single application, and modify and use a variable i. The solid lines indicate one possible ow of execution (A; B; C), the dashed lines indicate another (A; C; B). If the solid line represents the ow of execution before failure, then the dashed line could represent a divergence during execution replay if not handled correctly. Not only does thread execution order have to be maintained, but also the context switch of the threads must occur at the same address in the instruction stream. Otherwise, the variable i might not be set to the appropriate value Previous systems have dealt with these two forms of nondeterminism in di erent ways. Some systems tolerated them by forcing processes to take checkpoints before sending a message or producing an output to an end user 11]. The system need not reproduce the nondeterministic events as it would start from a checkpoint that occurred after the events a ected the computation. This solution however has a very large overhead, especially when a checkpoint has to involve several processes. Other systems converted these forms of nondeterminism into synchronous messages that can be logged e ciently 6, 16] . This solution is not satisfactory because it entails substantial changes to applications and operating systems. Additionally, it cannot support existing applications that rely on asynchronous noti cation through software signals.
Our Solution
We present in this paper an e cient solution to the problem of logging and replaying asynchronous signals in log-based rollback-recovery. This solution is applicable to any log-based protocol, and does not require hardware support. It relies on the following simple observation: The outcome of a program that receives software interrupts depends on the points within the execution at which the interrupts occur. Thus, if the program always starts from a given state and receives the same interrupts at the same execution points, it will produce the same output. We use instruction counters to enable this form of repeatable execution 8]. Using these counters, it is possible to log the number of instructions executed between consecutive interrupts during normal operation. If a failure occurs, the instruction counts in the log and the counters are used to reapply the signals at the same execution points. Thus, a rollback-recovery system can e ciently track nondeterminism due to asynchronous software interrupts.
Based on this technique, we have built a user-level thread package that allows e cient logging and exact replay of shared memory manipulations by multithreaded applications in a single-process environment. This package relies on a timer signal to schedule threads within an application. Using our solution for tracking signals, we can log and replay these timer signals to reproduce the same thread scheduling decisions that occur while an application is running. This way, the interleaving of shared memory access during normal operation is replayed and the same state is reproduced should a failure occur. Thus, we can use the same solution to provide a uniform treatment for nondeterminism due to software signals and multithreading.
Instruction counters can be implemented in hardware or emulated in software. We have chosen to use software emulation to avoid the need for specialized hardware support that may not exist on many systems. However, this software emulation is necessarily processor-dependent. Performance measurements taken from implementations on two di erent systems show a reasonable overhead during failure-free operation. Thus, the restrictions of the piecewise deterministic execution model can be removed at a reasonable cost.
The solution in this paper does not handle other types of nondeterminism, such as results from system calls and values read from an external input source or message. These have been handled by previous work 6, 11, 16, 17] . The performance study also is independent from any particular checkpointing or logging protocol to prevent any interference with the measurements. The rest of the paper is organized as follows. Section 2 describes the method proposed for tracking interrupts in modern RISC and CISC architecture. Section 3 discusses implementation issues in a rollback-recovery system. Section 4 contains an evaluation of the performance of the implementations presented in the previous section. A comparison with related research e orts takes place in Section 5 and Section 6 concludes the paper.
E cient Interrupt Logging and Replay
We use instruction counters to record the execution points at which interrupts occur during normal operation. This information is stored in a log and used after a failure to force the interrupts to occur at the same execution points. Thus, the pre-failure e ect of an interrupt is reproduced during execution replay, and the system reconstructs the pre-failure execution. We rst describe instruction counters and then show how we emulated them in software on two di erent architectures.
Instruction Counters
An instruction counter is a register that is decremented upon the execution of each instruction 8]. The hardware generates an exception when the register content becomes zero. An instruction counter can be used in two modes. In one, the register is loaded with the number of instructions to be executed. After the CPU executes the speci ed number of instructions, an exception is generated and propagated to a pre-speci ed handler. This mode is useful in setting breakpoints e ciently, such as during debugging. In the second mode, the instruction counter is loaded with the maximum value it can hold. Execution proceeds until an event of interest occurs, at which time the content of the counter is sampled, and the number of instructions executed since the time the counter has been set is recorded. In both modes care must be taken to avoid problems where the register might under ow. For example, a 32-bit register would under ow every 20 seconds of execution on a 200 MHz machine. Reinitializing the register at well de ned points resolves this issue. The use of instruction counters has been suggested for debugging shared memory parallel programs 8, 25, 28] .
Instruction counters may be available in hardware, as in the HP PA-RISC architecture. They also can be emulated in software 25]. The emulation uses the observation that a count of instructions is not strictly necessary for execution replay. Instead, a count of the branches taken by the program during execution would be su cient to determine the exact location whence an event of interest occurs. Speci cally, it is su cient to count only the backward branches, jumps and subroutine calls. Thus, a general purpose register is set aside to count branches, and the code is instrumented such that the following code sequence is inserted before each branch instruction:
Decrement Register Branch to handler if register = 0 Depending on the architecture, this code sequence is implemented using one to ve machine instructions. We implemented support for instruction counter emulation on both a modern RISC and a CISC architecture, namely the DEC Alpha and the Intel Pentium, respectively. These two architectures are arguably the best RISC and CISC processors available. The rst represents the RISC approach which argues for an abundance of registers and very simple and fast instructions. The second represents the CISC design philosophy with a small number of registers and powerful, complex instructions that result in small code footprints in the instruction cache.
Instruction Counter Emulation for the DEC Alpha
A straightforward application of the technique by Mellor-Crummey and LeBlanc would have resulted in about 5 instructions per basic block 25]. The resulting overhead would be large on a RISC machine like the DEC Alpha 25]. Our solution, however, dramatically reduces the overhead by exploiting the DEC architecture features and the fact that the purpose of emulation is to support rollback-recovery. Using these features, we limit the instrumentation overhead to a single instruction added to every basic block in a program. Additionally, our technique produces two di erent versions of the instrumented program, one for normal operation and one for execution replay after a failure. Our approach thus di ers from other instrumentation techniques which produce the same program for normal operation and execution replay 25]. The goal is to allow better optimization for the normal operation of the program, even if it comes at the expense of a higher performance penalty in replay. This is consistent with the expected mode of operation in a rollback-recovery system, where execution replay is the exception rather than the norm.
Instrumentation for Normal Operation
To instrument a program for normal operation, the compiler sets aside two general purpose registers to emulate an instruction counter. These registers are s5 and s6, which are the last two callee-saved registers on the DEC Alpha processor. One register is used as the branch counter, the other is not used during normal operation but must be reserved to maintain the one-to-one mapping with the replay binary (as explained in Section 2.2.2). The assembly language output of the compiler is scanned for backward branches, jumps and subroutine calls. Before each such branch or subroutine call, an instruction is inserted to decrement the branch counter. The instrumentation exploits features of the DEC Alpha architecture to reduce to one the number of instructions necessary to emulate instruction counters. Speci cally, no checks are made for register under ow exploiting the 64-bit wide registers of the DEC Alpha. Even at an unrealistic rate of one backward branch every nanosecond, assuming a perfect 1 GHz machine, it would take 585 years for the register to over ow! Furthermore, since the DEC Alpha has no condition codes, the decrement instruction can be added at the end of the basic block without causing any change in the processor's state. On other architectures, it may be necessary to compensate for the e ects of the additional instrumentation instructions on the condition codes 25]. This compensation typically requires additional instructions and therefore additional overhead.
An initialization routine is linked with every user program to set the branch counter to 0 at the beginning of the program. The user program and system libraries are prepared in this manner and can be then linked with an implementation of checkpointing and logging support.
Instrumentation for Replay
Execution replay starts by loading the branch counter with the counts read from the log, and the program executes this speci c number of branches. After these branches have been executed, a handler is called to set a breakpoint at the program counter (PC) indicated in the log. This breakpoint stops the execution of the program at the desired execution point where the signal is applied to the application. Unfortunately, the DEC Alpha architecture does not include a single instruction which decrements a register and branches to a speci c handler should the register's content become 0. Since, for e ciency only one instruction is added during normal operation before each backward branch, jump and subroutine call, only one instruction should be added before each branch during replay to preserves the one-to-one mapping of addresses between the two binaries. This one-to-one mapping is necessary for correct replay, since the replay program uses logged PC values which would be meaningless if the addresses are di erent between the two versions of the program.
Our solution involves inserting a subroutine jump instruction before each backward branch, jump and subroutine call. The instruction replaces the one that decrements the counter in the instrumented code for normal operation, maintaining one-to-one mapping of the addresses between the replay and the traced versions of the program. The jump is to a simple routine that decrements the branch counter, checks for under ow and inserts the breakpoint if necessary as described before. One register is set aside to hold the return address from this subroutine. The bsr instruction of the Alpha is used instead of jsr, since the former can be executed in one instruction while the latter requires an additional instruction to manipulate the global base register. choice is that our subroutine has to be within four Megabytes from the jump location. This is not a serious restriction, since this subroutine could be duplicated for each four Megabytes of code.
The simple routine for decrementing and branching consists of only four instructions for the normal case. These consist of the branch to the routine, a decrement instruction, a conditional branch on the counter if it has reached zero, and the return from the handler. When the counter reaches zero, fourteen additional instructions are executed (this does not include the instructions executed as part of the PALcode -Privileged Architecture Library -to synchronize the instruction and data caches,) which save the instruction at the PC read from the log and then set the breakpoint. When the breakpoint is reached, the handler for the signal recorded in the log le is called. After the signal handler returns, the next entry is read from the log le and the process repeats itself.
This solution thus consumes two registers, one to hold the branch counter and one to hold the return address from the \decrement-and-branch" subroutine. Only one register could have been reserved and the counter kept in memory. However, since tests showed no appreciable di erence in performance by reserving an additional register, replay performance was improved by keeping the counter in a register.
Instrumentation for the Intel Pentium PC
The arcane nature of the Pentium architecture has proved di cult to instrument. First, there are no general purpose registers like in RISC or even other CISC processors. Each register in the Intel architecture has a certain role and instructions use the registers according to their role. This feature makes it di cult to pick a particular register as a holder for the branch count. Such a register and all instructions that use it will not be available to the application. Moreover, there are only eight registers available to the application, which would increase the penalty of reserving one of them to instrumentation purposes. Furthermore, the Pentium architecture contains condition codes. Therefore, instructions cannot be arbitrarily placed in front of branches with no consideration for their possible e ect on condition codes. Instead, expensive compensation instructions must be added to the instrumentation code to prevent the instrumentation instructions from altering the semantics of the application 25]. Finally, registers are 32-bit wide, which require checking for under ow of the branch holder at each basic block.
Given all these limitations, there was no apparent bene t from preparing two different versions of the instrumented program like in the DEC Alpha. Thus, the same binary is used for both normal operation and replay on the Pentium. A command line switch or an environment variable can be used to determine which mode the program should run in.
Instrumentation for Normal Operation
We considered three alternatives for instrumenting the application code on the Intel Pentium. The alternatives di er in the selection of the placeholder of the branch counter and in the number of instructions that have to be added in each basic block. A key decision was whether or not to keep the counter in a register. Holding the counter in a register takes away from the application one of the processor's scarce registers. On the other hand, keeping the counter in memory would add the overhead of two additional memory accesses to every backward branch. Both options also required calling a subroutine to handle under ow of the counter. Yet a third option is to add a single instruction to each basic block, containing a jump to a routine that would decrement the counter and handle under ow. To ensure the best choice was made, the three solutions were implemented. Section 4 contains an analysis of the performance numbers gathered for each solution. The solution with the least impact on performance was the rst one proposed, with the counter residing in a register. The chosen solution called for reserving register EBX to hold the counter. This register has fewer instructions which use it speci cally, and is also used as the callee-saved register. Both these reasons make this choice the one with the least impact on compilation. If the compiler uses the special instructions which require speci c registers, then reserving this register has minimal impact. Also, callee-saved registers are the least used by most compilers.
Manipulating the EBX register, however, modi es the condition code register. To account for these e ects, each backward branch must be transformed to a series of instructions that preserve the semantics of the program. This series of instructions converts a conditional branch instruction to its opposite (i.e. a jump on equal would become a jump on not equal) and makes it branch past the added code. The net e ect of this is that only those backward branches which would be taken in uninstrumented code are counted. Figure 5 shows the instrumentation of a branch which requires the label:
. . . which do not rely on the condition codes and therefore do not require as much complexity in instrumentation. Finally, a subroutine call is made to handle the case of the counter under ow, when the counter reaches zero. The solution for normal operation also applies to replay.
Implementation Issues
In addition to the instrumentation support, an implementation must also include support for signal handling, and inclusion of the instrumentation routine within a logging and replay system. Additionally, we have used the instrumentation to support a userlevel thread package that provides repeatable thread scheduling across failures. We describe each of these issues.
Modi cation to Signal Handlers
When an instrumented program receives a signal it must record the branch counter value in a log. It is therefore necessary to modify application-level handlers to perform this task while preserving the application semantics. In our implementation, the instrumentation code scans the application program for system calls that could be used by the application to register signal handlers (e.g., signal, sigaction, sigvec in both OSF/1 and NetBSD). It replaces these system calls with function calls that register the application speci ed signal handler in a table, bypassing the operating system. Moreover, an initialization routine sets a special handler for all software signals that can be received by the application program. When the application receives a signal, this handler registers the instruction count and the program counter in the log along with the signal type. Additionally, on the Pentium the log also contains the under ow count. The handler then searches the table containing the addresses of the application-speci ed handlers to locate the appropriate handler. This latter is called as a subroutine from within our handler to preserve the semantics of the application. Care is taken to track the application ow and account for the cases where a certain signal is prevented from (or enabled to) a ecting the application ow.
Support for Logging
The instrumentation techniques are independent of any particular checkpointing or logging protocol. We outline here the necessary support for including the instrumented program in a rollback-recovery system. The signal handler that we described in Section 3.1 produces a log entry each time a signal occurs. Each log entry consumes 24 bytes on the Alpha (two eight-byte elds -the pc and branch counter -and two fourbyte elds) and 20 bytes on the Pentium ( ve four-byte elds.). The handler should simply be integrated with the message logging scheme used by the rollback-recovery protocol.
Support for Replay
If a process fails, the restart system will load the replay version of the code. This code is loaded along with the data that was stored in the checkpoint from which the restart begins. The replay version of the code is used to bring the process to the pre-failure state as recorded by the last entry in the log. Once execution replay is complete, the replay version of the code is replaced by the normal operation version and execution resumes.
The replay version of the code contains an initialization routine that runs after the state is restored to set a special handler for breakpoint exceptions. All other signals are masked until replay is complete. Execution replay proceeds under control as explained in Section 2, forcing interrupts to occur at the de ned points as speci ed in the log. This support should be integrated, of course, with other log replay functions depending on the rollback-recovery protocol used. Once the entire log is run, a checkpoint is taken and the replay version of the program is stopped. The normal version of the program is restarted from the reconstructed checkpoint. This avoids extending the performance penalty of execution replay beyond failure recovery.
Support for Multithreaded Applications
A pre-emptive thread package was implemented which supports tracking of thread switching decisions on a uniprocessor. The thread package relies on the operating system to generate an alarm signal to emulate a timer. Thread scheduling decisions occur inside the timer interrupt handler. Additionally, the scheduler is aware of the instrumentation. Therefore, it takes special care not to restore an invalid value from a previously saved context. With the tracking support of the timer signal, the scheduling decisions can be repeated during execution replay to reproduce the same thread execution patterns. Thus, access to shared memory by threads can be repeated, lifting an important restriction on the piecewise deterministic execution model 1 .
Evaluation and Experience
We measured the performance of several instrumented microbenchmarks and standard benchmarks. A full report is available elsewhere, and all results support the conclusions made here 31]. We report here the performance of three long-running, computeintensive programs from the SPEC95 benchmark suite. The three applications were instrumented as described in Section 2 for normal operation and replay. Additionally, we have instrumented the standard system libraries that are commonly linked with executables. The experimental environment consists of two systems. The rst is a DEC Alpha 3000/400 workstation running OSF/1 V2 and equipped with a 133 MHz Alpha 21064 processor and 64 MBytes of memory. The second system is a personal computer running NetBSD 1.1 and equipped with a 90 MHz Intel Pentium processor and 64 MBytes of memory. Both experimental setups used the GNU compiler version 2.7.2.
All measurements were taken with the machines running in single-user mode with no interference due to other processes or network activity. We report these measurements and also describe our experience using this instrumentation to support rollbackrecovery in a multithreaded, long-running simulation program. Note that the measurements do not re ect any checkpointing or message logging activity, and logs were ushed asynchronously to disk like in optimistic or causal logging systems 13]. Excluding overhead due to checkpointing or pessimistic logging is consistent with the purpose of these experiments, which is to identify the overhead due only to code instrumentation. By studying the instrumentation in isolation, we identify its costs independent of any particular message logging and checkpointing protocol. Otherwise, we would have to resort to a complicated analysis to isolate the e ects of instrumentation from other overhead sources.
Evaluation Using SPEC 95 Benchmarks
We measured the e ects of adding the instrumentation to three compute-intensive application programs from the SPEC95 benchmark suite. These programs are representatives of long-running applications that are likely to bene t from rollback-recovery in practice. Each contains a substantial program that serves as a stress test for the instrumentation code. They were also used to dissect several aspects of the instrumentation, and therefore the results reported here show the typical overhead that real applications may experience. Furthermore, the wide availability of these programs should help others in the research community reproduce our results. The applications are:
go: A program that simulates the game of go. compress: A le compression program similar to many popular compression utilities. jpeg: A video compression program that complies with the JPEG standard.
DEC Alpha Evaluation
The overhead due to instrumentation in the DEC Alpha architecture consists of two components. The rst is due to taking away two general purpose registers that otherwise are available to the application. With fewer registers to work with, the compiler generates additional \spill" code to load and store variables from and to main memory, decreasing performance. The second source of overhead is due to adding one instruction at the end of each basic block to decrement the register in normal operation, and to adding a subroutine call at the end of each basic block in the replay version.
During the initial phase of the implementation, we were particularly concerned about the e ect of taking away two registers from the application program. Therefore, we quanti ed this e ect by measuring the running time of each program with the two registers unavailable to the compiler, but without any instrumentation code. Then, we quanti ed the e ect of instrumentation by measuring the running time with the instrumentation code enabled in both the normal operation and replay versions. Thus, there are three variations for each application as follows: Replay: The application and libraries are compiled with 2 registers reserved and instrumented for replay operation. Table 1 shows the execution times for each of these variations. Additionally, the rst row in the table shows the measured running time for the unmodi ed SPEC95. The results show that the e ect of instrumentation on the running time is acceptable for the normal operation version. For the three programs, the maximum reported overhead is 5:7% for the compress benchmark. The measurements also show that the e ect of Table 2 : Alpha compute-intensive benchmarks: program sizes.
taking away registers from the application program is negligible. For the applications reported, the maximum overhead attributed to the reduction in the number of registers is 2:5%. The overhead of the instrumentation, however, is more pronounced for the replay versions, reaching 25% for compress. This gure is not surprising, given that adding a subroutine call to each basic block can have a serious performance impact on RISC processors. However, since this version runs only when a failure occurs and only until the state is restored to be consistent with the log and the system, such overhead is not a serious concern.
The results show an anomaly for the benchmark go. The instrumented code for this benchmark is better than the uninstrumented version. Furthermore, the results for this benchmark show that taking away two registers speeds it up by about 1%. All of these anomalies are due to the sensitivity of this benchmark to the cache behavior. By altering the executable code, the instruction cache may show better locality in some applications. These anomalies are not uncommon in modern RISC processors.
To conclude, the measurements show that the instrumentation code adds very low overhead during normal operation. These results attest to the viability of the proposed technique for tracking nondeterminism. Overhead for the replay version is higher, but it is acceptable to pay an extra overhead during infrequent replays in return for optimizing performance for the normal operation. Finally, included for completeness, Version go compress jpeg 3661 1038 3073 Table 4 : Pentium compute-intensive benchmarks: running time Table 2 shows the e ects of the instrumentation on the size of the programs and Table 3 shows the number of the backward branches jumps and subroutine calls in the code.
Intel Pentium Evaluation
As explained in Section 2, there are three alternatives for implementation the instrumentation on the Intel architecture. The three methods were implemented to determine which gives the best performance. Each benchmark was instrumented with each of the proposed three methods of counting instructions. Additionally, we wanted to isolate the overhead component due to instrumentation when the counter is in a register. Thus, each benchmark was also compiled with EBX reserved and no other modi cations. The variations on the benchmarks are as follows:
1-Reg: The application and libraries are compiled with the EBX register reserved.
No instrumentation code is added.
Chosen: The application and libraries are compiled with the EBX register reserved and the code instrumented for normal operation. Under ow of the branch counter is handled in a subroutine.
Mem: The application and libraries are instrumented such that the branch counter is stored in main memory.
Sub: The application and libraries are instrumented such that a branch to subroutine is added to every basic block to decrement the branch counter and check for over ow. The branch counter is kept in main memory. The rst row also shows the run times for each benchmark with no modi cations. The results attest to the superiority of the approach where a register is reserved for holding the branch counts. The overhead in the other two methods is almost unacceptable.
The results also show that while the penalty for providing the logging of interrupts on the Intel Pentium is reasonable, it is still higher than on the DEC Alpha. This result is due to three factors. First, there is the e ect of taking one register out of the eight available to the compiler (versus taking away two registers out of thirty-one available on the Alpha). Second, adding ve instructions to each basic block certainly adds considerable overhead. Third, the version of the Intel Pentium used in the experiment did not have a superscalar pipeline like the more recent versions do. Therefore, the Table 6 : Pentium number of backward branches, jumps and subroutine calls.
instrumentation did not bene t from pipelining or out of order execution which may reduce the overhead. The results also verify that replay has approximately the same penalty as normal operation as it is the same code (results not shown for brevity). Table 5 shows the e ect instrumentation had on the size of the programs. As expected, there was a signi cant increase in the size of the binaries for the chosen instrumentation method and for the one that stored the branch counter in main memory. The third one, Sub, has a smaller increase because it did not include code to decrement a counter on every branch, just the instruction to call a subroutine to do it. Lastly, for completeness, Table 6 shows the number of backward branches in each benchmark.
Multithreaded Network Simulator
We used the thread package described in Section 3 to support a partial port of Rice's YACSIM public domain simulator 23]. YACSIM is a multithreaded, process-driven simulator that has proved useful in simulating networks and computer architectures. The implementation was used to simulate an Ethernet-style network bus. A thread was used to represent a simulation process (in this example, an Ethernet packet). An average of four active threads running simultaneously was observed while the simulator was running. The measured running time for this program was 30:12 seconds on the DEC Alpha system. Instrumentation for normal operation reduced the running time to about 28:71 seconds on average (the program included 661 backward branches, jumps and subroutine calls). To inspect whether the anomaly was due to instrumentation, the running time for the simulation with two registers taken away was also measured. The time in this case was about 28:90 seconds, essentially the same as the instrumented version. These anomalies are again attributed to the alteration to the cache behavior of the processor. The replay version of the code ran for about 29:93 seconds, showing no substantial overhead compared to the normal operation version. These results again support the claim of the viability of the proposed technique in tracking nondeterminism.
Comparisons to Related Work
Previous fault-tolerant systems used di erent techniques to handle nondeterminism. The Hypervisor system uses hardware instruction counters to support primary-backup replication 7]. It exploits the hardware instruction counter available in the HP PA-RISC architecture. A virtual machine layer is inserted beneath the operating system. This layer uses the hardware counter to count instructions between hardware interrupts. It also records information about each interrupt at the primary machine. This information is gathered for a speci ed period called an epoch, typically tens of microseconds. Then it is shipped over the network to the backup. The backup uses the instruction counts to imitate the e ects of the hardware interrupts that occurred at the primary. The information about each interrupt is also available to reproduce the same state changes. Thus, the execution of the backup will be identical to the primary machine, only lagging by an epoch. The researchers in this project had to contend with various complexities that are related to nondeterministic cache line replacement and nondeterministic behavior of some input devices. They report a performance overhead of about 20% on average. Our solution di ers from this system in various aspects. We implement all support in software and on an application basis, compared to the hardware-supported, system-wide Hypervisor implementation. Second, our solution is optimized for rollback-recovery systems, where implementation for the normal mode of operation is faster at the expense of a possibly slower execution during playback. This design assumes that failures are not frequent and therefore the replay code will not be run very often. This solution, however, would not work for a primary-backup system because the backup would be unable to keep up with the primary in execution speed.
Other systems handled nondeterminism due to signals by converting them into synchronous messages that the application can receive at well de ned points. The Targon/32 6] system relied on hardware for e cient logging of messages exchanged between processes. To support nondeterministic execution, the operating system was rewritten to transform all asynchronous events into synchronous messages. Such a solution requires a substantial implementation e ort and it is not easy to port to other platforms because of the reliance on special hardware. The Manetho system supported limited forms of nondeterminism that could be tracked e ciently by the operating system 11]. The support included logging and replaying the results of certain system calls, and tracking the interprocess synchronization based on system-visible semaphores. This limited support was adequate on the V-System kernel which does not support asynchronous event noti cation 10]. Additionally, the V kernel's small context switching time allowed the usage of system-provided semaphores with an acceptable penalty. These limitations, however, could make this technique impractical in other systems. The Delta-4 system handled nondeterministic execution in active process replication 4, 9] . Their technique, certainly applicable to rollback-recovery, relied on modifying the application program to use polling. Interrupts are queued until the application program executes a polling routine, in which all replicas synchronize and agree on which interrupts were received and the order in which they are processed. Our technique di ers from this approach in that it does not convert the asynchronous event noti cation model into a synchronous one, and does not require applications to perform periodic polling. On the other hand, the Delta-4 approach is easier to implement and maintain than ours, and it is architecture and compiler independent.
Systems that support nondeterminism due to thread interactions supply their own sets of locking primitives, and require applications to use them for protecting access to shared memory 16]. The primitives are instrumented to insert an entry in the log identifying the calling thread and the nature of the synchronization operation 16]. However, this technique has several problems. It makes shared memory access expensive, and may generate a large volume of data in the log. Furthermore, if the application does not adhere to the synchronization model (due to a programmer's error, for instance), execution replay may not be possible. The solution presented in this paper works with a user-level thread package using signals to initiate a context switch. It does not rely on special synchronization primitives and thus requires less overhead and eliminates the possibility of the application programmer erring when using such primitives.
Conclusions
The applicability of the piecewise deterministic execution in practice has long been hampered by the inability to e ciently log nondeterminism due to software signals or shared memory accesses in multithreaded applications. We presented a technique for e cient logging of nondeterminism due to software signals that are applied at arbitrary execution points in a log-based rollback-recovery protocol. The technique relies on emulating instruction counters in software. The counter gives an accurate number of the instructions that are executed between interrupts. The counter also allows controlled execution during replay to execute the same number of instructions between interrupts as during normal operation. This technique was implemented with special care to optimize performance during normal operation, albeit at the expense of slower execution during replay on one of the supported architectures. We then used the implementation to support repeatable execution in multithreaded applications where nondeterminism arises due to interleaved shared memory access among threads. Thus, using a single technique we were able to lift two serious limitations on the piecewise deterministic execution model. The price of this instrumentation has been studied on two systems using the SPEC 95 benchmark and also a multithreaded application. A performance study showed that this support has a reasonable cost on modern architectures.
