Assembly instruction level reverse execution provides a programmer with the ability to return a program to a previous state in its execution history via execution of a "reverse program." The ability to execute a program in reverse is advantageous for shortening software development time. Conventional techniques for recovering a state rely on saving the state into a record before the state is destroyed. However, state-saving causes significant memory and time overheads during forward execution.
INTRODUCTION
As human beings are quite prone to making mistakes, it is very difficult for a programmer to write an error-free program without going through a debugging This work was supported by the State of Georgia under the Yamacraw initiative and by National Science Foundation (NSF) under grant INT-9973120, CCR-9984808, and CCR-0082164.
A preliminary version of this article appeared in Proceedings of the ACM SIGPLAN-SIGSOFT Workshop on Program Analysis of Software Tools and Engineering 2002 (PASTE'02).
Authors' address: School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250; email: {tankut,mooney}@cse.gatech.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org. C 2004 ACM 1049-331X/04/0400-0149 $5.00 cycle. For this reason, debugging is an important and inevitable part of software development.
A typical debugging cycle that many programmers go through is shown in Figure 1 . Unfortunately, many of the bugs in programs do not cause errors immediately, but instead the bugs show their effects much later in program execution. For this reason, even the most careful programmer, equipped with a state-of-the-art debugger, might well miss the first occurrence of a bug and thus might not be able to determine the bug location immediately after an error occurs. In this case, the programmer might have to restart the program as shown in Figure 1 . After the programmer can determine the bug location successfully, the programmer can attempt to remove the bug from the program and then recompile the program for further verification.
For difficult to find bugs, the programmer might have to go through the above steps (i.e., the loop shown in Figure 1 ) multiple times until a program which does not exhibit any errors is obtained. However, every time a restart occurs, parts of the program that already executed without errors have to be reexecuted unnecessarily. These unnecessary reexecutions may constitute a significant portion of the debugging time.
Reverse execution provides the programmer with the ability to return to a particular previous state in program execution. When the programmer misses a bug location by executing a program too far, he or she can roll back the program to a point where the program state is known to be correct and then reexecute from that point without having to restart the program. This potentially reduces the overall debugging time significantly.
Traditional methods for bringing a program to a previous state in execution history heavily employ state-saving. In state-saving, program states are saved into a record during forward execution of a program. Then, a particular previous state is simply restored from the saved record. State-saving can be achieved periodically where the program state is saved at certain time intervals, and/or incrementally where the modified program state is saved, instead of the whole program state. However, in whichever way it is performed, state-saving requires extra memory space to save the program state and also results in a slower forward execution due to the time spent for saving the states. This article presents a new approach that, for the first time ever (to the best of the authors' knowledge), achieves reverse execution at the assembly instruction level on general purpose processors via execution of a "reverse program." A reverse program almost always regenerates destroyed states rather than restoring them from a record, and provides assembly instruction by assembly instruction execution in the backward direction. This significantly reduces state-saving and thus decreases the associated memory and time costs of reverse execution. The following example introduces the notion of a reverse program used in our methodology. shows the original program instrumented with statesaving instructions by a typical state-saving technique to provide reverse execution over each assembly instruction ("save 0" and "save 1" are used to store the outcome of the conditional branch in Figure 2(d) ). As seen in Figure 2(d) , the use of a typical state-saving technique for reverse executing this sample program results in more than twice the number of state-saving instructions generated by the technique presented in this article. This much larger number of state-saving instructions results in significantly larger memory and forward execution time overheads.
The outline for the rest of this article is as follows. Section 2 explains the main challenges of reverse execution and states the motivation behind this research. Section 3 presents the related work. Section 4 states our assumptions, and Section 5 gives an outline of our approach. Then, we explain the three major steps of our technique in three consecutive sections: Section 6 explains the program partitioning process, Section 7 describes how an assembly instruction is reversed, and Section 8 explains how we generate a complete reverse program from individual reverse instructions. Next, Section 9 describes special details about how loops and memory references are handled. After explaining the various details, Section 10 gives a summary of the whole technique presented. Section 11 discusses the experimental results. Finally, Section 12 concludes the article.
Note that in the rest of this article, the word "instruction" is used exclusively to refer to an assembly instruction.
BACKGROUND AND MOTIVATION
An execution of a program T on a processor P can be represented by a transition among a series of processor states S = (S 0 , S 1 , S 2 , . . .). From this representation, instruction-level reverse execution of a program can be defined as follows:
Definition 2.1 (Instruction Level Reverse Execution) . Reverse execution of a program T running on a processor P can be defined as taking P from its current state S i to a previous state S j (0 ≤ j < i) by executing a set of instructions that reverses the effect of instructions in T . The closest achievable distance between S i and S j without any forward execution determines the granularity of reverse execution. If state S j is allowed to be as early as one instruction before state S i , then the reverse execution is said to be instruction-level reverse execution.
The simplest approach for obtaining a previously attained state is saving the state before it is destroyed. However, saving a state during execution of a program introduces two overheads: memory and time. A solution to reduce memory and time overheads would be to decrease the frequency of state-saving during program execution. However, this prevents an immediate return (i.e., a return without any forward execution) to an arbitrary point in execution history where state is not saved. Therefore, in applying state-saving, there usually exists a trade-off between the closest previous state that can be restored without any forward execution and memory/time overheads due to state-saving.
Performance and memory constraints or lack of compiler support usually forces assembly language programming of some software components such as small-scale embedded applications, firmware for consumer electronics, DSP libraries and operating system modules like schedulers, high-performance I/O routines or device drivers. For instance, the majority of boot code for the computer system of the Pathfinder Spacecraft was written in assembly language because it was critical for the computer to boot up very quickly in case of a failure [Stolper 1997 ]. Therefore, during debugging of such software components, programmers have to be involved in instruction-level program execution. Furthermore, in implementing a language construct such as a pointer to an integer, sometimes the compiler generates assembly different from what the programmer expected. These reasons are why most of the debugger tools for software contain assembly-level execution views. Thus, reverse execution at the instruction-level granularity turns out to be very helpful when debugging these sorts of software components.
During debugging of programs written either in a high-level or in a low-level programming language, programmers typically use a single-stepping facility to locate bugs. It is not uncommon for programmers to miss a bug location by executing just one more step over the next statement or instruction in the program. In such a case, instruction-level reverse execution provides an extremely fast backup capability.
However, due to the trade-off between memory/time overheads and the closest previous state that can be restored, providing instruction-level state recovery by state-saving can translate into very high memory and time overheads during program execution. Therefore, our goal is to achieve reverse execution at the native instruction level with low memory and time overheads, which will open the way for addition of a missing feature, instruction-level reverse execution, to state-of-the-art debuggers.
RELATED WORK
The problem of how to acquire previously destroyed program states has been researched in several contexts. Previous work in this area can be divided into two different categories. The first category is the application of pure state-saving approaches to restore earlier states in program execution. The second category is reverse execution by a combination of state-saving and state-regeneration techniques. State regeneration introduced in the second category reproduces previously destroyed values and thus achieves state recovery without statesaving. In this section, we provide a summary of previous work in each category mentioned above.
Previous Work in Restore Earlier State
Zelkowitz [1971] provides a state restoration capability by inserting trace statements into the programming language. Each trace statement includes an option that indicates either a condition or a label. Program state is captured starting from a trace statement until the condition indicated by the trace statement is satisfied or until the label indicated by the trace statement is reached. However, • T. Akgul and V. J. Mooney III the programmer has to anticipate which parts of the program he or she might have to reexecute and then has to insert trace statements in those program parts beforehand. Agrawal et al. [1991] provide a statement-level state restoration capability for programs written in a high-level programming language. Agrawal et al. [1991] statically associate each assignment statement with a set of variables, called a change-set, which is modified by that statement. Then, during execution, the associated variables in the change-set are recorded (saved to memory) for rollback. However, although this approach provides a statement-level state restoration capability, it might cause large memory and time overheads during program execution, especially with programs that modify the state frequently.
State restoration is also applied in so-called replay techniques for efficient debugging of programs by making use of either hardware [Bacon and Goldstein 1991; Sosic 1994] or software [Feldman and Brown 1988; Maruyama and Terada 2000; Miller and Choi 1988; Netzer and Weaver 1994; Pan and Linton 1988] . In the replay technique, the state of a program is first saved infrequently during execution of the program, and then the program state that is not saved is reconstructed by replaying the program in the forward direction. In hardware approaches, state-saving is handled by hardware with little or no performance overhead, but with inflexibility and high cost. On the other hand, in software approaches, state-saving is handled by software with flexibility and low cost, but with high performance overhead. A typical drawback of these replay techniques is that since the recorded trace keeps only partial information about program state, execution can be restarted only at the beginning of a trace window where state is saved, and not at an arbitrary program point. Trace windows may be fixed [Feldman and Brown 1988; Netzer and Weaver 1994] or variable [Miller and Choi 1988; Netzer and Weaver 1994] in length, and their sizes may either be determined by the programmer [Netzer and Weaver 1994] or by a program analysis [Miller and Choi 1988; Netzer and Weaver 1994] . For instance, in Miller and Choi's [1988] approach, trace windows called emulation blocks are usually chosen to be program regions having well-defined entry points. One such region is a subroutine. Generally, larger trace windows result in smaller traces but longer replay times.
Two other application areas of state restoration are optimistic or speculative computation [Fujimoto 1989; Gomes 1996; Jefferson 1985] and fault tolerance [Chandy and Ramamoorthy 1972; Lee and Shin 1984] . A computation is optimistic if incorrect computation is allowed during execution. Tasks executing in parallel usually have to block due to synchronization requirements on shared data. In optimistic parallel execution, tasks do not block on shared data and thus are allowed to execute independently, which potentially improves the execution performance but at the same time allows incorrect computation. Then, errors caused by possible incorrect computations are recovered by rolling back the computation of erroneous tasks to a point in time where state is known to be correct. Similarly, state restoration for fault tolerance is performed by rolling back in case software errors occur, which is usually seen in places such as database transaction systems [Bhargava 1999; Gray and Reuter 1993] .
Rolling back computations or transactions is usually achieved by periodic or incremental state-saving. In periodic state-saving [Fleischmann and Wilsey 1995] , the whole processor state is recorded periodically at certain checkpoints during simulation. Then, a previous state at a checkpoint can be recovered by restoring that state from the record. However, in this method, a previous state at an arbitrary point that is not a checkpoint cannot be immediately recovered. If the checkpointing interval is reduced, memory and time overheads of statesaving are increased. Moreover, recording the whole processor state at each checkpoint causes redundancy because some portion of the processor state may be kept unchanged throughout several checkpoints. In incremental state-saving (ISS) [West and Panesar 1996] , instead of recording the whole processor state, only the modified parts of a state are recorded at each checkpoint. However, in programs where the modified state space is large, memory and time overheads of incremental state-saving might again exceed affordable limits.
State restoration is also used in computer science education where students can easily navigate back and forth through well-known algorithms to understand the behavior of such algorithms. For this purpose, the common technique applied is program animation [Birch et al. 1995; Crescenzi et al. 2000] . Program animation constructs a virtual machine with a reversible set of instructions. Since these instructions are reversible, the program can be run backwards. However, since reversible instructions are usually constructed as stack operations, a significant amount of stack memory may be required in program animation. Cook [2002] presents a state restoration method for Java programs. A Java virtual machine is another example of a stack-based virtual machine. Most Java bytecode instructions pop the operands from the operand stack, operate on them, and then push the resulting operands back into the operand stack. In his method, Cook keeps two main circular buffers for keeping the program state. The first buffer, the program counter circular buffer, keeps the program counter values. The second buffer, called the logging circular buffer, keeps the values in the operand stack that are destroyed as a result of a bytecode instruction. Then, Cook associates a reverse operation with each bytecode instruction. Basically, during a reverse operation, the program counter is restored from the program counter circular buffer while the values in the operand stack are restored from the logging circular buffer. By using circular buffers, Cook bounds the memory requirement which otherwise would be huge due to logging of each modified operand. However, this also bounds the total number of bytecode instructions that can be reverse executed at a time and requires a run-time validation of whether there is enough data within a circular buffer for accomplishing a reverse execution request.
Previous Work in State Regeneration at the Source Code Level
Floyd [1967] uses state regeneration in the area of nondeterministic algorithms. A nondeterministic algorithm is an algorithm that may come up with a different solution to a problem at each run of the algorithm. However, the solution is not reached by a random process, but by incrementally constructing a path that leads to a success. In Floyd's approach, whenever a nondeterministic algorithm enters a path leading to a dead end, the algorithm state at the most recent point where a decision is made is restored and alternative solutions are sought from that point on.
Floyd achieves reverse execution by defining a reverse operation for each operation in a nondeterministic algorithm. However, a reverse operation without state-saving can only be generated for reversible constructive operations. A constructive operation is an operation where the variable being modified (the target operand) is the same as one of the source operands. The operation "x = x + 1" is an example of a constructive operation. On the other hand, a constructive operation is reversible only if there exists an operation that can fully recover the prior value of the target operand of the constructive operation. For instance, the prior value of the target operand of "x = x ⊕ 1" can be fully recovered by executing the same operation once more. Therefore, "x = x ⊕ 1" is reversible. On the other hand, although "x = x / 2" is constructive, this operation might not be reversible for the case where the target operand x cannot always be fully recovered due to a possible precision loss in the computed result.
State regeneration finds its application in a limited sense in the area of debugging optimized code as well [Adl-Tabatabai and Gross 1993; Hennessy 1982; Wismuller 1994] . Hennessy [1982] introduces the term "currency" of a variable. A variable is current at a program point if the value of the variable at that program point is the same as the variable's expected value according to the source code. Since code optimizations, such as code motion and dead variable elimination, may move or remove assignments to variables in the object code, the value of a variable at a certain point in the optimized code may not be equal to the value of the variable at the corresponding point in the unoptimized code, which causes the variable to be "noncurrent" at that program point. In such a case, the current value of the variable has to be recovered to provide the user with a consistent view of the program being debugged. This recovery operation is where reverse execution comes into play.
A typical recovery technique in this field is to reevaluate noncurrent variables using appropriate definitions of those variables in the program. However, since the main focus in this area has been on the determination of whether a variable at a program point is current or not rather than on the recovery of a noncurrent variable, the recovery techniques applied in this area are generally very restrictive and ineffective. For instance, Wismuller [1994] reports that only 2-5% of all encountered noncurrent variables can be recovered in his benchmarks. Carothers et al. [1999] introduce another approach for optimistic parallel simulations. This approach is source transformation. In source transformation, the source code (e.g., in C) is transformed to a reversible source code version. Then, from the reversible version a reverse source code is obtained. The reverse source code reverses the effect of the original source code. However, similar to Floyd's [1967] approach. Carothers et al. [1999] also apply state-saving for all operations except reversible constructive ones. Consequently, the execution time and memory requirements of the reverse code are increased. Biswas and Mall [1999] also generate a reverse program from a program given in the C programming language. For constructive operations in C which use operators such as " * =" or "+=", they generate reverse statements that use the inverse operators. Thus, for instance, " * =" becomes "/=". However, similar to other approaches presented in this section, these constructive operations are the only cases where reverse execution is performed without state-saving. Moreover, in cases where the underlying processor may truncate the result (such as an overflow condition), the correctness of the reverse execution is indeterminate. For all other operations, Biswas and Mall [1999] keep a trace file that holds the necessary state to reverse execute a C program. In their article Biswas and Mall [1999] state (referring to nonconstructive assignment statements in the form of a = b op c), "Thus, it is not possible to define an inverse for this [sic] type [sic] of assignment statements and the old value of the variable must be maintained in a trace file." In this article, we show that even destructive assignment statements are indeed reversible without state-saving.
Previous Work in State Regeneration at the Assembly Instruction Level
We have made an exhaustive literature survey searching for a software approach that achieves reverse execution at the assembly instruction level through the generation of a reverse program. However, we could not find any such work. Therefore, this article might be the first work to achieve reverse execution at the assembly instruction level by almost always regenerating state via a reverse program.
THE PROPOSED APPROACH AND PRELIMINARY ASSUMPTIONS
Our approach is mainly based on regenerating a previously destroyed state rather than restoring that state from a record. When state regeneration is not possible, however, we recover a destroyed state by state-saving. In this section, we briefly describe how we achieve state regeneration as well as state restoration based on our assumptions. We achieve both state regeneration and state restoration by the help of a reverse program. Given a program T , either written directly in assembly or compiled into assembly from a high-level programming language, we generate a reverse program, RT from T , by a static analysis at the assembly instruction level. We call our algorithm, which we use to generate a reverse program, the reverse code generation (RCG) algorithm.
Suppose that a program T attains a series of states, S = (S 0 , S 1 , S 2 , . . .), during the execution of T on a processor P where the distance between two consecutive states is one instruction. Now, assume that we can generate another program RT, the instruction-level reverse program of T , such that when a specific portion of RT is executed in place of T when the current state of T is S i , the state of T can be brought to a previous state, S j (0 ≤ j < i). In other words, RT recovers a previously destroyed state. If RT contains an executable portion for changing the state of P from any state, S i ∈ S, to any other previous state, S j ∈ S ( j < i), for any possible state sequence S during execution of T , then the execution of T can be reversed by executing RT in place of T . To build a complete reverse program, the RCG algorithm uses pure static information. However, in order to reverse execute a program, dynamic information may be required as well. This dynamic information mainly consists of the values that cannot be recovered by state regeneration and thus that are saved dynamically. The statically generated instructions inside the reverse program use this dynamic information to undo the original instructions. Therefore, in case of state restoration, RT simply issues memory load instructions that restore the values already saved by state-saving instructions inserted into T . In the case of state regeneration, RT recovers a state without any state-saving code in T .
However, in practice, it might be hard to implement such a program RT which recovers 100% of the program state either by state regeneration or by state restoration due to the following reasons.
(1) Typically, processors include auxiliary hardware usually not accessible by the instructions directly. The processors usually manipulate this kind of hardware implicitly. Therefore, it is typically hard to recover indirectly modified state in this kind of hardware. As an example, consider the overflow register of a processor. The overflow register is written indirectly by an operation such as "c = a + b", if an overflow occurs during such an operation. However, many processors do not specify an instruction to directly write to the overflow register, which makes it hard to recover the overflow register. (2) Generally, writing a value to the program counter either by a branch instruction or by direct modification causes an immediate jump to the address location designated by the value written to the program counter. Therefore, as soon as RT were to recover the program counter, the execution of RT would immediately be broken. This suggests that the program counter should be recovered only at the end of the execution of a specific portion of RT, and just before the user switches back to forward execution. However, since it is not known a priori what program part the user will reverse execute (i.e., which portion of RT the user will run), it is impractical to recover the program counter inside RT. (3) If an instruction modifies a memory location, the instruction encoding only tells us the modified address but not the physical location actually being modified in the memory hierarchy (i.e., L1 cache, L2 cache or main memory). Without the knowledge of the physical location actually being modified, it is typically hard to recover the exact physical memory state (including exact cache state).
Therefore, regarding item (1) above, we define a program state S = (M , R ) that includes only directly modified memory (M ) and directly modified register (R ) values (i.e., M and R only include the memory locations and registers that appear as operands of the instructions of T ). Assuming that we can generate an instruction-level reverse program RT for a program T , we can recover all memory and register values that are directly modified by T . On the other hand, we handle the recovery of indirectly modified memory/register values that have an effect on T 's state by the help of the debugger tool.
For more information about recovering indirectly modified memory/register values, please refer to Akgul and Mooney [2002a] .
Regarding item (2) above, since the program counter value carries important debugging information, we must provide a means for restoring the program counter value. We solve this problem by leaving the recovery of the program counter value to the debugger tool. We associate the address of each instruction in T with the beginning address of the corresponding portion in RT that reverses the effect of that instruction. In this way, when a part of T is reverse executed by executing the corresponding portion in RT, the debugger tool restores the value of the program counter by using the connection between the addresses in T and RT.
Finally, regarding item (3) above, we treat memory as a unified abstract entity which keeps the values of high-level program variables. In other words, as long as the destroyed values of a variable can be retrieved, we do not distinguish between whether the variable is actually kept in processor cache or main memory. Consequently, undoing a memory store operation on a program variable only comprises recovering the previous value of the program variable but not the exact previous state of the processor cache or main memory. For instance, a variable may originally reside in main memory but not in the L1 cache. However, after the variable is destroyed and subsequently recovered, the variable might be brought into the L1 cache. Therefore, this process recovers the value of the variable, but does not restore the original state of the L1 cache.
In addition to the three items above, there are also other assumptions we make. First, we address single-threaded programs that run on a single processor only. Second, the RCG algorithm is not applicable to programs that are self-modifying or that change base addresses of program sections (such as the global data section or the stack) dynamically. Third, we assume that in case of an exception, the exception handler saves the program context such as the program counter and the register state just before the exception. Therefore, exceptions can be reversed by recovering the saved program context. Finally, we assume that the external inputs/outputs to/from a program (such as file I/O) are recovered by state-saving.
Having discussed some of our initial assumptions and goals, we next illustrate the highlights of the RCG algorithm.
OVERVIEW OF THE RCG ALGORITHM
The RCG algorithm can be divided into three main steps. These steps are program partitioning, reversing the assembly instructions, and combining the reverse instructions. In the following three subsections, we outline these three steps with the help of an example program. Then, in Sections 6 to 8, we will explain each step in detail.
Program Partitioning
Many programming-language-related algorithms require a certain analysis which is used to extract information from the subject program. The same is true When analyzing a program, it is essential to first select the program level at which to perform the analysis. Is the analysis to be carried out on the whole program at once or on individual functions separately, or on even smaller regions, such as basic blocks, one at a time?
It is usually difficult to perform control-flow analysis across indirect calls because an indirect call may be made to a statically unknown address. Therefore, in order to simplify the RCG algorithm, we prefer to restrict the control-flow analysis to inside certain regions of code in which control flow can be statically determined. We name these regions program partitions (PPs). Across PPs, however, control-flow information is dynamically recorded via a state-saving technique which will be explained in Section 8.3.2.
For most instruction sets (e.g., PowerPC, x86 and ARM), PPs are delimited by indirect branches or "function call" instructions that may exist in the code. An indirect branch delimits a PP because an indirect branch instruction may direct the control to a statically unknown address. A function call instruction, on the other hand, delimits a PP because the control may return from the called function whose address may be statically unknown to the address following the function call instruction. In the PowerPC instruction set, "bl" (branch with link register update) and "blr" (branch to link register), instructions are examples for function call and indirect branch instructions, respectively. Figure 3 (b), the portion of main that starts from the first instruction of main and extends until the function call instruction in main, is a PP inside main. Similarly, the remaining portion of main is another PP. On the other hand, since foo does not contain any function call/indirect branch instructions in the middle, the whole body of foo is a PP by itself.
Reversing an Assembly Instruction
In Section 4, we stated that the reverse program recovers only directly modified program state. Therefore, under this assumption, reversing an assembly instruction is equivalent to recovering the register and/or the memory value(s) being directly overwritten by that instruction. In this section, we outline how the RCG algorithm recovers a directly modified value.
After the PPs are determined, the RCG algorithm goes over every instruction within all PPs in program order (lexical order) and checks whether each instruction directly modifies a register or a memory location. If an instruction directly modifies a register or a memory location, the RCG algorithm generates a group of one or more instructions which recovers the overwritten value in that register or memory location. We call such a group of instructions a reverse instruction group (RIG).
A value destroyed by an instruction can be recovered in three ways: (i) the value can be recalculated during instruction-level reverse execution, which we call the redefine technique; (ii) the value can be extracted from a previous use during instruction-level reverse execution, which we call the extract-from-use technique; and (iii) the value can be saved during forward execution and then restored during instruction-level reverse execution, which we call the statesaving technique. Figure 4 shows T (from Figure 3 ) and T 's reverse RT. An instruction i x in T and the generated RIG, RIG x , for that instruction are marked with the same index x in Figures 4(a) and 4(b), respectively. Note that the instructions that are shown in bold are extra instructions that are inserted into the original program for state-saving; thus, these instructions do not have associated RIGs. In addition, as will be explained in the next section, control flow is reversed by control-flow predicates inserted into the reverse code. Therefore, the control-flow instructions do not have associated RIGs in the generated reverse program. Consequently, in Figure 4 , we did not assign indices to the control-flow instructions in the original program or to the state-saving instructions inserted into the original program to enable reverse execution. Now, let us have a closer look at some generated RIGs.
Example 5.1. Consider in Figure 4 the instruction i 8 which subtracts r 3 from r 12 and writes the result into r 11 . Since the only value that is being directly changed by i 8 is r 11 , the generated RIG for i 8 , RIG 8 , should recover r 11 . First of all, the RCG algorithm finds the value of r 11 that needs to be recovered. For this purpose, the RCG algorithm performs reaching-definitions analysis within the PP under consideration. The definition that reaches the point just above i 8 is r 11 = 3, which comes from i 3 . Therefore, the value of r 11 to be recovered is 3. This value is used within the division operation at i 5 before being destroyed. However, the division operation does not allow the extraction of the destroyed value due to a possible loss of precision. Thus, the extract-from-use technique is inapplicable in this case. However, the redefine technique is applicable. Therefore, the RCG algorithm simply places the found reaching definition into RIG 8 which redefines the destroyed value and thus recovers it.
Note that the regeneration of a value may require the regeneration of other values. For example, this would happen if the value of a register were to be regenerated using the value of another register that is also overwritten. Therefore, the redefine and extract-from-use techniques are usually applied recursively (see Section 7).
On the other hand, if there are multiple definitions reaching at a point along different control-flow paths, then the definition to be recovered depends on the dynamic control flow of the program. In this case, the RCG algorithm generates instructions that recover each statically reaching definition. Then, the RCG algorithm gates these instructions using conditional branches which determine (according to the dynamic control flow of the program) which definition should actually be recovered in a particular instruction-level reverse execution instance. The following example illustrates this case.
Example 5.2. Consider RIG 9 in Figure 4 . In RIG 9 , the conditional branch instruction chooses the correct definition of r 12 to be recovered based on the specific control-flow path taken in forward execution of the program (see Section 7 for a detailed explanation).
Finally, if a value can be recovered neither by the redefine technique nor by the extract-from-use technique, the RCG algorithm applies the state-saving technique. Consider the following example.
Example 5.3. In Figure 4 , the value of r 3 that is being destroyed by i 1 comes from outside of the PP in which i 1 resides. Thus, this value cannot be redefined within the RIG to be generated. Moreover, this value is not used within the PP before being destroyed. Therefore, in this case, the RCG algorithm recovers r 3 by state-saving which is performed by the inserted save instruction before i 1 . Then, the generated RIG, RIG 1 , simply restores r 3 's value from the saved record.
From what we outlined above, we can state that the RCG algorithm tries to apply state regeneration by using the redefine and the extract-from-use techniques as much as possible. The RCG algorithm applies the redefine and extract-from-use techniques in combination to produce the smallest RIG. In case neither of these techniques are applicable, a RIG is generated by employing state restoration. Generally, state-saving is applied if (i) the reachingdefinitions analysis cannot accurately find the value that is being destroyed as in the case of memory aliasing (see Section 9.2), or (ii) the value to be recovered comes from outside of the PP under consideration and the extraction of the value from a previous use within the PP is not possible.
We next explain the third and final main step of the RCG algorithm.
Combining the Reverse Instruction Groups
In order to keep the state consistent, the instructions that are executed in a certain order during forward execution are supposed to be reversed in the opposite order during instruction-level reverse execution. This is similar to reversing a sequence of movie frames where the frames during fast backwarding are shown in the opposite order they are shown during forward play of a movie. In this section, we outline how we combine the RIGs in order to establish a control flow which is exactly opposite to the control flow of the original program.
From the control-flow analysis point of view, we represent each PP by a control-flow graph which we call a partitioned control-flow graph (PCFG). A PCFG is no different than an ordinary control-flow graph except that a PCFG is not necessarily constructed for a complete function or a whole program. Figure 5 (a) shows the PCFGs for the three PPs of T , defined previously in Section 5.1. As seen in the figure, each PCFG is further divided into basic blocks (BBs). Therefore, we represent a program in a hierarchical structure. PPs form the highest level of hierarchy, then come the BBs, and, finally, come the individual instructions. The RIGs are combined in accordance with our hierarchical representation of a program. From the lowest level of program hierarchy to the highest level, this combination process can be outlined as follows:
(1) Combine the RIGs generated for a BB to construct the reverse of that BB which we designate as RBB for short. (2) Combine the RBBs generated for a PP to construct the reverse of that PP which we designate as RPP for short. (3) Combine the RPPs to construct the reverse program.
Within basic blocks of a PP, control follows a linear path. Since we want the instructions to be reversed in the opposite order they are executed in the forward direction, linear execution within basic blocks can simply be reversed by placing the RIGs in the reverse order the original instructions are placed. We call this placement order the bottom-up placement order. For instance, as seen in Figure 5 , the RIG generated for the first instruction of foo in BB 1 is placed at the end of the corresponding RBB, RBB 1 ; the RIG generated for the second instruction is placed on top of the previously generated RIG, and so on.
Then, we consider the second level of hierarchy above where RBBs are to be combined into RPPs. In this case, we insert conditional branch instructions into the reverse code to provide a link between nonconsecutive RBBs. Much like a railroad track changer, the conditional branch instructions inserted after an RBB select the correct path to be taken following that RBB. This selection is automatically performed by the predicates of the inserted conditional branch instructions. For instance, after RBB 4 , a conditional branch is inserted which determines which RBB (RBB 2 or RBB 3 ) to execute next. Finally, at the highest level of hierarchy, we combine the RPPs by using branch instructions, which is similar to the combination of the RBBs. However, unlike the branch instructions used to combine the RBBs, the target addresses of the branch instructions used to combine the RPPs might be statically determined (in case a PP is immediately reachable from a unique source address) or might also be dynamically determined (in case a PP is immediately reachable from multiple source addresses). For instance, for T , we statically know that foo can immediately be reached from the first PP of main only. Similarly, the second PP of main can immediately be reached from foo only. Therefore, the reverse of foo is linked to the reverse of the first PP of main by using a branch instruction whose target address is hardcoded. Similarly, the reverse of the second PP of main is linked to the reverse of foo by a branch with a hardcoded target address (see Figure 4(b) ). On the other hand, the technique we use for dynamically determining the branch targets is fairly detailed; therefore, we leave the rest of this discussion to Section 8.
So far, we introduced the basics of the three RCG steps. In the following three sections, we will present each RCG step in more detail.
RCG STEP 1: PROGRAM PARTITIONING
In the previous section, we noted that the RCG algorithm divides the program under consideration into smaller program partitions. In this way, we can carry out control-flow analysis easily. In this section, we explain the details about how we generate PPs given a program at the assembly instruction level.
The RCG algorithm constructs a partitioned control-flow graph, PCFG = (N , E, start, exit), to extract a PP from a program. A PCFG is built for every PP in a program. N is the set of nodes, E is the set of edges representing the flow of control between the nodes, and start and exit are the unique entry and exit nodes of the PCFG, respectively. Each node in a PCFG represents a basic block (BB). Since most modern processors support only two-way branches, we assume that a BB in a PCFG may have, at most, two outgoing edges, one for the target path and the other for the fall-through path of a conditional branch instruction-ending that BB. (I.e., a multiway branch in a high-level programming construct, such as a C "switch" statement, is expressed by a combination of two-way branches at the assembly level.)
Listing 1 shows the pseudocode for program partitioning. Construct PCFG() builds a PCFG for each PP in the program under consideration by reading the instructions of the program in a loop (lines 5 to 10 of Listing 1). Construct PCFG() starts the construction of a PCFG by inserting a start block at the beginning of that PCFG (line 4). Then, in the loop, Construct PCFG() adds BBs to the PCFG until a function call instruction (e.g., the PowerPC "bl"-branch with link register update-instruction) or an indirect branch instruction (e.g., the PowerPC "blr"-branch to link register-instruction) is encountered in the program being analyzed. When Construct PCFG() encounters a function call or an indirect branch instruction, Construct PCFG() ends the construction of the PCFG by Note that an interpartitional control-flow edge coming to a BB j in a PCFG i is represented as an edge from the entry block of PCFG i to BB j . Similarly, an interpartitional edge going out from a BB j in a PCFG i is represented as an edge from BB j to the exit block of PCFG i . Therefore, a PCFG does not contain any edges coming from or going to any other PCFG, which completely isolates PCFGs from each other. The interpartitional control flow between the PPs, on the other hand, is represented by a separate call graph which will be explained in Section 8.3.1.
RCG STEP 2: RIG GENERATION
A RIG reverses the effect of an instruction that directly modifies a register or a memory location. This section presents a detailed description of RIG generation.
Suppose that a definition δ destroy destroys the value D of a variable V (a directly modified register or memory location) at a program point as shown in Figure 6 . Let us name the program point just before δ destroy as P.
Each statically reaching definition δ i of V at point P might correspond to the instance where D is actually assigned to V (Figure 6 ). The definition that corresponds to the actual assignment instance is the definition that dynamically reaches point P. Therefore, recovering D means recovering the definition of V that dynamically reaches point P.
The definition of V that dynamically reaches point P depends on the dynamically taken path to P. However, the path that will actually be taken is typically not known prior to program execution. Therefore, we use the following technique to recover D: we generate sets, each of one or more instructions, where each set recovers one or more definitions of V statically reaching P along at least one path. For instance, referring to Figure 6 , we can generate a set that recovers δ 1 . This set indeed recovers D if path 1 is dynamically taken. Similarly, we can generate another set that recovers δ 2 . This second set indeed recovers D if either path 2 or path 3 is dynamically taken. We generate as many sets as necessary to cover all possible paths to δ destroy from the definitions of V reaching P. If more than one set is generated, we tie the sets together via conditional branch instructions. The predicates of the conditional branch instructions carry the dynamic control-flow information of the program. Therefore, the correct set to be executed during reverse execution is automatically selected by these predicates. If a predicate is also destroyed before δ destroy , then, in the same way, we generate the sets that recover that predicate. The sets that recover the reaching definitions of V , the conditional branch instructions (if any) that are used to gate these sets and the instructions (if any) that are generated to recover the predicates all together constitute a RIG for δ destroy .
Let us now describe how a set of instructions which we will denote by ζ can be generated to recover at least one definition of V reaching P. As briefly mentioned in Section 5.2, there are three techniques that are followed to generate a ζ : the redefine technique, the extract-from-use technique and the state-saving technique.
The Redefine Technique
The redefine technique places into ζ the instruction α i that computes D i at the definition δ i statically reaching point P (Figure 6 ). If any one of the variables, say x i , that is used for computing D i is also destroyed, then the instruction that recovers x i must be inserted before α i in ζ ; this must be applied recursively for all other modified variables in the dependency chain. For instance, assume that a register r i depends on another register r j , r j depends on r k , and r k does not depend on any other variable in the program. Therefore, there is a dependency chain from r i to r k . Also, assume that both r i and r j are destroyed, while r k is available. Then, using the redefine technique recursively, the RCG algorithm first recovers r j using r k , and then recovers r i using r j .
The redefine technique can potentially recover only one definition δ i of V reaching P: namely, the definition δ i which is redefined. Therefore, if a variable has more than one statically reaching definition (e.g., δ i and δ j ) at a point in a program, the redefine technique has to be applied, if applicable, to each statically reaching definition (e.g., both δ i and δ j ) separately. Note that, however, the external value of an input variable (e.g., a global variable or an input argument) of a PP is certainly not defined within the PP, but comes from outside of the PP. Therefore, the external values of variables of a PP cannot be recovered by the redefine technique.
The following example illustrates how the redefine technique works.
Example 7.1 (The Redefine Technique). Consider the instruction that overwrites the value of register r 12 in BB 4 in Figure 7 (we need the overwritten value of r 12 because the overwritten value is used both in BB 2 and BB 3 ). Let us name this instruction as α (i.e., α = "r 12 = r 11 * r 10 ") and the analysis points just before α as P and just after α as P . There are two different definitions of r 12 reaching P on two different paths: "r 12 = r 10 + 1" and "r 12 = r 3 | 15". Therefore, the value of r 12 at point P is either "r 10 + 1" or "r 3 | 15". Moreover, neither r 10 nor r 3 is modified after being used to define r 12 and before point P . Therefore, r 12 can be recovered on one path by executing the set ζ 1 = {(r 12 = r 10 + 1)}, and r 12 can be recovered on the other path by executing the set ζ 2 = {(r 12 = r 3 | 15)}.
The Extract-from-Use Technique
The extract-from-use technique places into ζ an instruction β that extracts the destroyed value of V out of a use µ (including a possible use of V by δ destroy itself) on the path(s) between δ destroy and any definition of V reaching P (Figure 6 ). However, again, if any other variable x i in β that is used for extracting V is also destroyed, then an instruction that recovers x i must be inserted before β in ζ ; this must be applied recursively for all other modified variables in the dependency chain.
As opposed to the redefine technique, the extract-from-use technique can recover multiple definitions (e.g., both δ 2 and δ 3 in Figure 6 ) of V reaching P. Moreover, since the external value of an input variable of a PP may be used within the PP, the input values to a PP may be recoverable by use of the extract-from-use technique. However, the extract-from-use technique is less likely to be applicable than the redefine technique because there might not always be a use µ on a path to δ destroy , and, even if a use is available, µ's operation might not always allow such an extraction of the value of V . For example, the instruction "r 3 = r 1 / r 2 " might prevent the extraction of r 1 or r 2 since the result of the division operation might be truncated due to the limited precision r 3 can represent.
In general, operations such as "integer add", "integer subtract", "integer multiply", and "shift" allow extraction of values provided that the information that will allow such an extraction is not lost due to an overflow/underflow or a shift-out during these operations. On the other hand, operations such as "integer divide" and floating point calculations do not typically allow extraction of values due to a possible loss of precision on the result. The decision on whether or not to use the extract-from-use technique on the operations that might not be reversible in special situations such as overflow/underflow or shift-out is left to the programmer. For example, the programmer may use compiler warnings, the overflow/underflow detection logic of the processor, or overflow/underflow detection code to ensure that the program is free of overflow/underflow situations.
The following example illustrates how the extract-from-use technique works.
Example 7.2 (The Extract-from-Use Technique). Consider again the instruction named α in Figure 7 . After the two definitions of r 12 reaching P, there are two uses of r 12 on each path: "r 11 = r 12 − r 3 " and "r 11 = r 3 − r 12 ". Moreover, neither r 11 nor r 3 is modified between the points of uses and point P . These subtractions are performed as integer operations and thus they are reversible provided that their results are not truncated. Thus, if the point P is reached passing through the use "r 11 = r 12 − r 3 ", the destroyed value of r 12 can be obtained by executing the set ζ 3 = {(r 12 = r 11 + r 3 )}; if P is reached passing through the use "r 11 = r 3 − r 12 ", then the destroyed value of r 12 can be obtained by executing the set ζ 4 = {(r 12 = r 3 − r 11 )}.
The State-Saving Technique
The RCG algorithm applies the redefine and the extract-from-use techniques in a combination to come up with the smallest RIG. However, due to the limitations of these techniques described in the previous subsections, we may not be able to generate all of the sets necessary to cover all paths to δ destroy (Figure 6 ). Even worse, as in the case of memory aliasing which will be described in Section 9.2, we may not be able to find the statically reaching definitions of V at all. In such circumstances, the RCG algorithm resorts to the state-saving technique.
In general, we save a state by inserting a push-like instruction into the original code, just before δ destroy . The inserted instruction saves the state (e.g., r 9 in Figure 8 ) that is being modified by δ destroy into a free-memory location that is pointed to by a memory pointer (usually a register) and moves the memory pointer to the next free location. Then, in the reverse program, a pop-like instruction is generated that moves the memory pointer to the next value to be restored and restores the saved value from memory.
A push-/pop-like instruction refers to an instruction that works in the same way as an ordinary push/pop instruction. However, a push-/pop-like instruction can work on any memory pointer, while a push/pop instruction can work only on the stack pointer. For instance, the PowerPC 860 provides store-update and load-update instructions that can be used as push-like and pop-like instructions, respectively. Ordinary push and pop instructions are not considered for state-saving in order to prevent the possible corruption of the stack. If the target architecture does not support pop-like/push-like instructions that automatically increment/decrement a memory pointer, save and restore operations are handled by using ordinary store and load instructions with explicit increment/decrement of a dedicated memory pointer.
An Example of RIG Generation
In the previous three subsections, we explained the three methods (redefine, extract-from-use and state-saving) we use to generate a set that recovers at least one definition of the variable under consideration. We also stated that a RIG is nothing but a combination of those sets which cover all possible paths to the destruction point. In this section, we give an example of a complete RIG generation by using the PCFG shown in Figure 9 .
Example 7.3 (RIG Generation). In Examples 7.1 and 7.2, we gave four different sets ζ 1 , ζ 2 , ζ 3 and ζ 4 each of which recovers the value of r 12 along a particular path to point P in Figure 9 . Let us now pick some of these sets to cover all the paths to point P and combine the selected sets to generate a complete RIG for recovering the value of r 12 . Let us pick ζ 1 = {(r 12 = r 10 + 1)} to recover r 12 along the left path to point P and ζ 4 = {(r 12 = r 3 − r 11 )} to recover r 12 along the right path to point P in Figure 9 (note that the RCG algorithm always chooses minimum-sized sets to produce a smallest possible RIG). Since all paths to point P are covered, ζ 1 and ζ 4 are enough to generate a RIG.
We should now combine ζ 1 and ζ 4 by using a conditional branch instruction that determines along which path P is reached. The predicate of this conditional branch instruction is r 10 > 100. However, we cannot use this predicate directly because the value of r 10 is destroyed in BB 2 . Therefore, we should first recover r 10 . We can recover r 10 by two successive applications of the redefine technique: we first redefine "r 11 = 3" and then redefine "r 10 = r 3 /r 11 " (r 11 is redefined because it is destroyed as well). Note, however, that since our aim is to recover r 12 only, we should use a temporary register r t instead of r 11 and r 10 in order not to destroy the values of r 11 and r 10 at point P. Therefore, a RIG for recovering r 12 in PowerPC assembly can be generated as follows: li r t , 3 divw r t , r 3 , r t cmpwi r t , 100 bgt L1 sub r 12 , r 3 , r 11 b L2 L1: addi r 12 , r 10 , 1 L2: . . . Listing 2 shows the pseudocode for the generation of a RIG (with minimum size cost, C) for an instruction α. To find the minimum-sized RIG, we apply the extract-from-use technique (line 13) and the redefine technique (line 16) for different paths, starting from reaching definitions of t (if a reaching definition cannot be statically found, we save state-line 27). For this purpose, we process all different uses (with reversible operators) and/or definitions on different paths, where each use/definition covers a set of one or more paths. If the cost of the final RIG is infinity, which means neither the extract-from-use technique 
RCG STEP 3: COMBINING THE RIGS
The last step to build a reverse program is to combine the RIGs together. As mentioned in Section 5.3, RIG combination is a hierarchical process. In the lowest level, RIGs are combined into RBBs. Then, in the next higher level, RBBs are combined into RPPs. Finally, at the highest level, RPPs are merged to form the reverse program. The following three subsections present each level of hierarchy in detail.
Constructing the RBBs
The first step to combine the RIGs is to build the reverse of each basic block in the original program. The instructions within a BB complete in lexical order. Thus, to keep the state consistent during instruction-level reverse execution, the RIGs should execute in an order exactly opposite to the order of execution of the instructions in the original program's BB. Therefore, placing the RIGs in the order opposite to the lexical order of a BB is sufficient to generate the reverse of that BB. In other words, if a basic block BB i in the PP under consideration has a sequence of instructions I BB i = (α 1 , α 2 , α 3 , . . . , α n ), and if the corresponding RIGs generated for BB i are RIG BB i = {RIG 1 , RIG 2 , RIG 3 , . . . , RIG n }, then the reverse of BB i , designated as RBB i , consists of the sequence I RBB i = (RIG n , RIG n−1 , RIG n−2 , . . . , RIG 1 ). Note that since a generated RIG, RIG k (1 ≤ k ≤ n) in I RBB i , may contain branch instructions (see Example 7.3), an RBB may not necessarily be a single basic block, but instead may be a combination of multiple basic blocks. The following example shows how the RBBs are constructed from the BBs of a PP. Figure 10(a) shows the PCFG of PP x previously shown in Figure 7 , and Figure 10(b) shows the RBBs generated for the reverse of PP, RPP x . The RCG algorithm generates the reverse of each BB in PP x by combining the generated RIGs in bottom-up placement order in RPP x . While the reverse of BB 1 , BB 2 and BB 3 (namely, RBB 1 , RBB 2 and RBB 3 ) are constructed each as a single BB, the reverse of BB 4 , RBB 4 , consists of three separate BBs. RBB 4 is separated into three BBs because the reverse of the instruction "r 12 = r 11 * r 10 " in BB 4 consists of multiple instructions two of which are branches (as given in the assembly listing in Example 7.3).
Example 8.1 (Constructing the RBBs).

Constructing the RPPs
To generate the reverse of a PP, the RBBs generated for that PP should be combined in an appropriate way. Once again, this combination should satisfy our argument that the RIGs should execute in the order opposite to the execution order of the instructions in the original program. Since edges in a PCFG designate the control flow between the BBs, we reverse the control flow simply by Inverting the edges implies two facts. First, the RBBs are placed in an order opposite to the order of BBs in a program. This is same as the bottom-up placement order of RIGs within an RBB. Second, a confluence point of incoming edges in a PCFG typically becomes a fork point of outgoing edges in the reverse version of that PCFG, and vice-versa. Consequently, one or more conditional branch instructions are needed at the generated fork points in the reverse program.
Suppose that a confluence point C o in a PCFG becomes a fork point F r in the reverse PCFG as seen in Figure 11 . Depending on the number of incoming edges to C o (or outgoing edges from F r ), the RCG algorithm inserts at F r one or more conditional branch instructions that decide on which edge to take at F r during instruction-level reverse execution (Figure 11 ).
Due to the linear orientation of code in memory, typically the target of one of the n outgoing edges from F r immediately follows F r in address space. Let us name this outgoing edge as e (Figure 11 ). Since it is inefficient to generate a conditional branch whose target address is the next address in memory, the RCG algorithm omits the generation of a conditional branch instruction for e. A conditional branch instruction is generated for all other edges leaving F r . Therefore, due to our prior assumption that the target processor architecture supports only two-way branches (see Section 6), the number of conditional branches to be inserted at a fork point with n targets is n − 1.
The decision on which edge to take at F r is automatically performed during instruction-level reverse execution by the predicates of the conditional branch instructions inserted at F r . This approach is similar to the use of predicates for choosing the correct set to execute inside a RIG (see Section 7). Since F r corresponds to C o , the predicates to be chosen at F r are essentially the same as the predicates that determine along which edge C o is dynamically reached during forward execution. An interested reader can refer to Mooney [2002a, 2002b] for more information about how we find the correct predicate to use for each conditional branch instruction inserted at a fork point in an RPP.
On the other hand, suppose that a fork point F o in a PCFG becomes a confluence point C r in the reverse PCFG (recall from Section 6 that a fork point in a PCFG can have at most two outgoing edges at the assembly level). This scenario is depicted in Figure 12 . In this case, it is necessary to establish a link between C r and each RBB that is the source of one of the joining edges at C r (RBB 1 and RBB 2 in Figure 12 ). Again, due to linear orientation of code in memory, the RBB that corresponds to the BB on the fall-through path of F o (e.g., RBB 1 in Figure 12 ) will always immediately precede C r in the reverse code. Hence, a link between RBB 1 and C r is already established. Therefore, the remaining part is to provide the link between C r and RBB 2 that corresponds to the BB on the target path of F o . This link is established by inserting an unconditional branch at the end of RBB 2 as shown in Figure 12 .
The following example illustrates how the RBBs are combined to generate an RPP. Figure 13 shows the PCFG of PP x and the PCFG of the corresponding RPP, RPP x . Also seen in the figure are the assembly listings of the instrumented PP x (i.e., instrumented with statesaving instructions) and RPP x . As seen by the assembly listings, the RBBs are placed in bottom-up placement order in memory. Since the RBBs are combined with the inverted versions of the edges in the PCFG of PP x , the confluence point designated as C o in the PCFG of PP x becomes a fork point designated as F r in the PCFG of RPP x , and the fork point designated as F o in the PCFG of PP x becomes a confluence point designated as C r in the PCFG of RPP x . Consequently, a conditional branch instruction is inserted at point F r and an unconditional branch instruction is inserted at the head of one of the joining edges at C r .
Example 8.2 (Constructing the RPPs).
The predicate of the conditional branch inserted at point F r in RPP x is r 10 > 100. Since r 10 is not available at point C o which corresponds to point F r in RPP x , this predicate is first recovered just like it is recovered in RBB 4 (see Example 7.3). Also, as seen in Figure 13 , instead of using r 10 > 100, the RCG algorithm uses the complementary predicate r 10 ≤ 100 at point F r . This can be explained as follows: In PP x , BB 2 immediately follows the conditional branch at point F o . However, due to the bottom-up placement order of RBBs, the situation is opposite in the reverse code. Namely, instead of RBB 2 , RBB 3 immediately follows the conditional branch inserted at point F r (see assembly listing of RPP x ). Therefore, the predicate is inverted in the reverse code.
Note that an unconditional branch instruction is placed only at the end of RBB 3 that corresponds to the BB on the target path of the fork in PP x (the other RBB, RBB 2 , simply precedes RBB 1 in address space). 
Combining the RPPs
After generating an RPP, the RCG algorithm combines that RPP with the other RPPs that have already been generated. In order to achieve this, the RCG algorithm must know the control-flow information between the PPs in the program under consideration.
Determining the Control-Flow Information between PPs.
Since PCFG construction is performed for each PP separately, each PCFG designates
Listing 3 Grow CG():
The CG construction algorithm Input: A program partition PP j for which a PCFG has been generated Output: A node n j in the CG with a set of edges connected to n j begin 1 Add a node n j to CG for PP j 2 for all PP k immediately reached from PP j do 3 if (n k = node of(PP k )) = NULL then 4
Add to the CG an edge e jk from n j to n k 5 Annotate e jk 6 else 7
Set e jk as pending 8 end if 9 end for 10 if n j has a pending incoming edge then 11
for all e ij from n i to n j do 12
Add to the CG an edge e ij from n i to n j 13
Annotate e ij 14 end for 15 end if end the control flow within a particular PP only. In other words, a PCFG does not contain any edges that show the flow of control between the PPs. Therefore, the control-flow information between the PPs is determined by another graph, G = (N , E, s, t) , which is a call graph (CG).
In G = (N , E, s, t), the set N is the set of nodes designating the PPs in a program, and the set E is the set of edges designating the flow of control between those PPs. The notations s and t designate the unique entry and exit nodes of a CG. Note that an indirect branch whose target PP is statically unknown may potentially invoke any PP. Therefore, if a PP, PP i , makes an indirect call whose target PP is statically unknown, the RCG algorithm inserts an edge from PP i to every other PP.
To learn from which address(es) a PP can be immediately reached and to be able to move the control backwards to a source address, the RCG algorithm annotates an edge e ij ∈ E from a PP, PP i , to another PP, PP j , with the address of the instruction in PP i that transfers control from PP i to PP j .
Listing 3 shows the pseudocode for call graph construction. Grow CG() adds a new node to the CG for a PP when a PCFG is built for that PP. After a new node n j is generated for a PP, PP j , Grow CG() checks the PPs that are immediately reachable from PP j . For every PP, PP k , that is immediately reachable from PP j and for which a PCFG (and thus a node in the CG) has already been generated, Grow CG() adds an edge e jk from the node of PP j to the node of PP k and annotates e jk with the address of the instruction transferring control from PP j to PP k (lines 2 to 5 of Listing 3). For every other PP that is immediately reachable from PP j but for which a node has not yet been generated, Grow CG() sets a pending edge (lines 6 and 7). Then, Grow CG() checks whether PP j has pending incoming edges set for it. If PP j has pending incoming edges, Grow CG() adds to the CG all the pending incoming edges that are set for PP j and annotates those edges appropriately (lines 10 to 15). 
Using the CG to Combine the RPPs.
Equipped with the control-flow information between PPs, the final step in the RCG algorithm is to invert that control flow to combine the RPPs into a complete reverse program. In this section, we describe how we use a CG to achieve this task.
If a program partition PP i is immediately reachable from a single static location whose address is A in the program under consideration (i.e., there is a single edge coming to the node of PP i in the CG and that edge has an annotation A), then in the reverse code, the address RA corresponding to A is the unique address to which the control has to be directed after the reverse of PP i , RPP i , is executed. This is easily handled by inserting at the end of RPP i an unconditional branch instruction whose target address is RA. However, if a program partition PP i is immediately reachable from multiple static locations (i.e., there are multiple edges coming into the node of PP i in the CG), the location from which PP i is immediately reached during a specific execution of the program and thus the corresponding location in the reverse code to which the control should be directed after executing RPP i , the reverse of PP i , cannot be obtained statically. Therefore, in such a case, the RCG algorithm applies a dynamic technique, called the stack-tracing technique, to find the location to which the control should be directed after executing RPP i . We next describe the stack-tracing technique. The stack-tracing technique can simply be described as saving the statically unknown return addresses of RPPs into a stack at run-time. During reverse execution, the saved addresses are popped back from the stack to provide return from an RPP. There are similar well-known forms of the stack-tracing technique in the literature that are usually applied for obtaining call traces of procedures. In the context of reverse code generation, we apply the state-tracing technique to learn in which order the PPs are visited in a specific forward execution and thus to be able to reverse that order during reverse execution.
Let us assume that a subset PP of the PPs in the program under consideration contains all PPs that are immediately reachable from multiple static locations. We will designate the set of the reverses of these PPs as RPP . Thus, the remaining PPs in the program, but not in PP , are immediately reachable from a single static location each. Also, assume that there are a total of n locations from which control reaches the PPs in PP . We will designate the addresses of these n locations as A = {A 1 , A 2 , . . . , A n }. We will also designate the corresponding n addresses in the reverse code as RA = {RA 1 , RA 2 , . . . , RA n }. Therefore, after executing the reverse of a program partition PP i ∈ PP during instruction-level reverse execution, control should be directed to an address RA i if and only if the control has reached PP i from the corresponding address
The addresses to which control should be transferred from an RPP in RPP during a specific reverse execution of the program under consideration can be found by saving the addresses in RA into a run-time stack during forward execution. In other words, whenever a transfer from an address A i in A to a program partition PP j in PP occurs during forward execution, we save the corresponding reverse address RA i in RA to a run-time stack in order to provide a return from the reverse of PP j , RPP j , to address RA i during reverse execution.
Listing 4 shows pseudocode for the stack-tracing technique. The stacktracing technique inserts instructions both into the original and the reverse code to invert the control flow between the PPs that are immediately reachable from multiple static locations (lines 5 and 7). The instructions inserted into the original code handle the return address bookkeeping task by saving into a run-time stack the addresses in RA that correspond to the addresses on the dynamically taken edges in the CG. The instructions inserted into the reverse code, on the other hand, dynamically retrieve the saved addresses from the stack and transfer the control to the retrieved addresses. Figure 14 is shown in Figure 15 (a). Figure 15(b) shows a sample PP visit order for a specific execution instance of the program as well as the RPP visit order for the corresponding reverse execution. According to Figure 15(b) , the program under consideration is forward-executed starting from the beginning of PP 1 , traversing several PPs (some twice), until ending after PP 2 . Then, this execution is reversed by executing the corresponding reverse program starting from the beginning of RPP 2 (the reverse of PP 2 ), traversing several RPPs, until finishing reverse execution at the end of RPP 1 (the reverse of PP 1 ). The RPP sequence is marked with timestamps (encircled numbers in Figures 15(b) and 15(d) ) which indicate the instances when during reverse execution control is transferred between the RPPs. Note that for simplicity in this example, we assume the PPs are entered from the beginning and exited from the end (i.e., control does not leave or enter to a PP from a midpoint of the PP). Thus, the addresses on the edges of the call graph correspond to the end addresses of the PPs.
Example 8.4 (Combining the Reverse Program Partitions). The example CG from the program in
Since PP 2 , PP 3 and PP 4 are immediately reachable from multiple static locations within the program as seen in Figure 15(a) , the RCG algorithm, as seen in Figure 15(d) , inserts indirect branch ("blr"-branch to link register) instructions to the end of the corresponding reverse program partitions RPP 2 , RPP 3 , and RPP 4 where the target addresses of these indirect branch instructions are determined dynamically during reverse execution. On the other hand, since PP 5 is immediately reachable from a single static location (from PP 3 ), an unconditional branch instruction with the hardcoded target address RA 3 (which is the reverse of the unique call location A 3 -the end of PP 3 ) is inserted to the end of the corresponding RPP, RPP 5 as seen in Figure 15(d) . Figure 15 (d) also shows a table which indicates the dynamically-determined target addresses of "blr" instructions inserted at the end of RPP 2 , RPP 3 , and RPP 4 and at what timestamp instances these addresses are determined. The second column of the table corresponds to RPP 2 , the third column corresponds to RPP 3 , and the fourth column corresponds to RPP 4 .
The final state of the run-time stack M at the end of forward execution of the program is shown in Figure 15 (c). When a call is to be made to PP 3 from PP 1 , the address RA 1 , which corresponds to call location A 1 indicated by the address annotation on the edge from PP 1 to PP 3 in Figure 15 (a), is entered into M . Then, a recursive call is made to PP 3 from PP 3 itself, and address RA 3 (which corresponds to call location A 3 ) is entered over the previous entry in M . When address A 3 at the end of PP 3 is reached again, PP 3 this time makes a call to PP 5 and the corresponding address RA 3 is entered into M again. Similar steps are followed to enter the rest of the addresses into M .
During instruction-level reverse execution, the stack-tracing technique determines the target addresses of the "blr" instructions that are placed at the end of RPPs by popping the entries from M starting from the top. For instance, at the end of RPP 2 , the top entry in M of Figure 15 (c) is popped. The popped address is RA 4 ; therefore, the target address of the indirect branch at the end of RPP 2 is dynamically set to RA 4 and a jump is made to RA 4 , which is the beginning address of RPP 4 . When the end of RPP 4 is reached during reverse execution, the current top entry in M is popped once more. The popped address is RA 4 ; therefore, the control is directed to RPP 4 again. Similar steps are followed during the rest of reverse execution, which results in the correct ordering of the visits to the RPPs as seen in Figure 15 (b).
SPECIAL RCG DETAILS
In the previous sections, we covered the major steps of the RCG algorithm and described how a reverse program is built by analyzing and reversing the effects of individual instructions of a program. However, reverse code generation for loops as well as pointer-and array-based memory operations deserve additional explanation. The following two sections focus on these issues.
Handling Loops
In this section, we explain how the RCG algorithm generates reverse code for the instructions within a loop. 
Listing 5 Loop Gen(): Reverse code generation within loops
return RIG α 9 end if end A variable modified by an instruction α within a loop L may show a transient behavior at the early iterations of L until the values that come from outside of L are propagated into the loop body. Thus, the code that reverses the effect of α may be different for different instances of α (i.e., for the instances due to different loop iterations). Consider the following example.
Example 9.1. Figure 16 shows a loop with four instructions. The values obtained by the target operands of the instructions at successive iterations of the loop are also shown in the figure. As seen from the figure, it requires three loop iterations until a pattern is observed in the values obtained by the target operand of the first instruction. The values obtained by this operand are affected by the values input to the loop and are totally unrelated at the early instances of the first instruction in the loop. Therefore, it is necessary to reverse the effect of each such instance of the first instruction separately.
Listing 5 shows the algorithm snippet for reverse code generation within loops. In order to capture the transient behavior explained, Loop Gen() calls Gen RIG() at each traversal of L for a single instance of α (line 1 of Listing 5). In other words, at the first traversal, Loop Gen() generates a set of instructions, ζ 1 , which reverses the effect of the first instance of α; at the second traversal, it generates another set of instructions, ζ 2 , which reverses the effect of the second instance of α; and so on. Since the instructions within L repeat exactly, if a set of instructions generated to reverse the effect of an instance of α makes use of the instructions within L only, then that set can be used to reverse the effect of all the later instances of α as well. In this way, Loop Gen() can decide on when to stop the traversals over L.
Ideally, the traversals over L should be repeated until Loop Gen() can construct a set that makes use of the instructions within L only. However, we limit the maximum number of traversals over a loop body to three not only to limit the time cost of the RCG algorithm, but also to limit the length of the reverse code generated for α. This number is arbitrarily chosen and can be increased at the expense of having a larger reverse program. If a set cannot be constructed within three traversals over L, we apply state-saving to generate a RIG that reverses the effect of all the instances of α (line 7). In case state-saving can be avoided, on the other hand, the generated sets of instructions at each traversal (i.e., the sets from ζ 1 up to ζ 3 ) are combined to produce a RIG for α (line 2). The set of instructions to be executed within the RIG during a specific instruction-level reverse execution is determined by the help of a loop counter which distinguishes among different loop iterations.
The following example illustrates reverse code generation for loops.
Example 9.2. Figure 17 shows a symbolic version of the generated RIG for the first instruction α in the loop of Figure 16 . Figure 17 also shows the loop unrolled three times where each unrolled iteration corresponds to one of the traversals of the RCG algorithm over the loop body.
At the first traversal of the loop, Gen RIG() finds the reaching definition of the destroyed register r 1 as "r 1 = 0" at point P1. Then, Gen RIG() generates • T. Akgul and V. J. Mooney III the set ζ 1 as "r 1 = 0" which reverses the effect of the first instance of α (i.e., δ 4 ) by using the redefine technique.
At the second traversal of the loop, the definition of r 1 to be recovered is the definition that reaches P2. This definition is "δ 4 : r 1 = r 2 + 2" which comes from within the loop this time. In order to recover r 1 from δ 4 , the RCG algorithm needs the value of r 2 . However, r 2 is destroyed by δ 5 between δ 4 and P2. The destroyed definition of r 2 is "δ 2 : r 2 = 3" which comes from outside of the loop. Therefore, Gen RIG() first puts into ζ 2 the instruction "r t = 3" which recovers the value of r 2 into a temporary register r t using the redefine technique (r t is used instead of r 2 to preserve the current value of r 2 ). Then, the Gen RIG() puts into ζ 2 the instruction "r 1 = r t + 2" which recovers the destroyed value of r 1 .
At the third traversal of the loop, we are at point P3. The reaching definition of r 1 at P3 is "δ 7 : r 1 = r 2 +2". The definition of r 2 used in δ 7 is "δ 5 : r 2 = r 3 × 3" and is destroyed by δ 8 before reaching P3. Therefore, we have to recover r 2 before recovering r 1 . However, r 3 as used in δ 5 does not reach point P3, either. Moreover, r 3 has been overwritten twice after being used in δ 5 : once by δ 6 and once by δ 9 . Thus, Gen RIG() first puts into ζ 3 the instructions "r t = r 3 − 1" and "r t = r t − 1" which recover the value of r 3 into a temporary r t by using the extract-from-use technique twice (once on δ 9 and once on δ 6 ). Then, Gen RIG() puts into ζ 3 the instruction "r t = r t × 3" which recovers the value of r 2 into r t by using the redefine technique. Finally, Gen RIG() puts into ζ 3 the instruction "r 1 = r t + 2" which recovers the value of r 1 . Since ζ 3 is constructed using instructions only from within the loop, ζ 3 indeed reverses the effect of the later instances of α as well. Therefore, for this example, it is sufficient to traverse the loop three times to generate a RIG for α, without state-saving. As seen in Figure 17 , the set (ζ 1 or ζ 2 or ζ 3 ) to be executed within the generated RIG during instruction-level reverse execution is determined by a loop counter (r LC ) which is inserted into the original loop.
Note that the generated reverse code in this example is unoptimized. However, the reverse code can be easily optimized by a separate pass using standard optimization techniques such as constant propagation or common subexpression elimination.
The technique described in this section is applied in a straightforward way to the nested loop structures wherein the passes over the nested loops are completed starting from the innermost loop going to the outermost loop.
Distinguishing among Memory Accesses
One other important issue we need to mention is how the RCG algorithm reverses the effect of memory accesses performed using pointers and arrays.
In order to apply either the redefine technique or the extract-from-use technique to recover the values in memory locations, one should distinguish among different definitions made into memory locations. However, distinguishing among the definitions of memory locations is not as easy as distinguishing among the definitions of registers. This is because a memory location being written by an instruction is not always apparent within the instruction encoding, which is the case for indirect addressing. Consequently, it might be hard to determine whether two memory stores made by two different instructions are to a same location or not.
In case of indirect addressing, two specific values often help determine the location where an indirect store is made. These are the base and the offset. The target address of each indirect store instruction can be expressed as the summation of the base value with the offset value. If the base and offset values of a store instruction can be determined statically, then the store operation is unambiguous (e.g., a store operation for an ordinary variable, a pointer with a statically known target and an array with a statically known index).
On the other hand, if the base and/or the offset value of an indirect store instruction cannot be determined statically, then the store operation is ambiguous (e.g., a store operation for a pointer aliased to a statically unknown variable, an array with a statically unknown index or a store to heap memory).
In case of direct addressing, the target address being written by the store is apparent in the instruction encoding. Therefore, a direct store operation is unambiguous.
In the following paragraphs, we first explain how unambiguous memory stores can be distinguished, and then we explain how ambiguous memory stores are treated. Figure 18 shows a memory organization made by a typical compiler. In a typical compiler, all unambiguous indirect local stores within a PP use the frame pointer (or the stack pointer if the frame pointer is not available as a dedicated register) as the base and a statically known value (usually a constant) as the offset. All unambiguous indirect global stores in a program use the beginning address of the global data section as the base and a statically known value (again, usually a constant) as the offset [Aho et al. 1986] .
Since an indirect branch instruction such as a "return from a function" both delimits a high-level program function and a PP, and since a PP is also delimited by a function call instruction that may reside inside a high-level program function, a PP is always equivalent to or a subset of a high-level program function. The important point here is that for a typical compiler, the beginning address of the global data section is fixed throughout the execution of a program, and the value of the frame pointer is fixed during the execution of a high-level program function. Therefore, both the the frame pointer and the beginning address of the global data section are fixed throughout the execution of a PP as well. Thus, within a PP, an unambiguous indirect store can be distinguished from other unambiguous indirect memory stores by finding the fixed base address and the statically known offset being used by the store.
On the other hand, all unambiguous direct memory stores can trivially be distinguished from all other unambiguous memory stores by checking the target address in the encoding of the direct store instruction.
Finally, ambiguous memory stores are treated conservatively. Namely, we assume that an ambiguous memory store is capable of changing the value of any memory location. Therefore, we apply state-saving both for generating the reverse of an ambiguous memory store instruction and for recovering the values of memory locations that are reachable by that ambiguous memory store instruction.
SUMMARY OF THE OVERALL RCG ALGORITHM
The overall RCG algorithm is summarized by the flowchart shown in Figure 19 . Given a program at the assembly instruction level, the RCG algorithm first divides the program into PPs (by building PCFGs) and constructs a CG of the program (Box 1 in Figure 19 -note that the algorithmic details implementing Box 1 have already been described previously in Listing 1 and Listing 3). Then, the RCG algorithm enters a main loop where the instructions of each PP are read, one after another, and the RPPs are built.
After an instruction is read, the RCG algorithm checks whether the instruction directly modifies a register or a memory value. If yes, the RCG algorithm generates a RIG for the read instruction. If the instruction is outside a loop, the RCG algorithm generates the RIG by calling Gen RIG() directly (Box 2); otherwise, the RCG algorithm calls Loop Gen() to generate a RIG for the instruction (Group 2).
After the RCG algorithm generates a RIG for an analyzed instruction in a BB, the RCG algorithm places that RIG into the corresponding RBB in bottom-up placement order (Box 3). As described in Section 9.1, some instructions within a loop require more than one pass over the loop body (excluding the initial pass over the whole program to generate the PCFGs and the CG) before reverse code can be generated for those instructions without state-saving. Therefore, if an analyzed instruction is inside a loop and the generation of a RIG without statesaving requires another pass over the loop body, the RCG algorithm traverses the loop body once more, provided that the total number of passes over the loop body will not exceed three.
When the RCG algorithm completes the construction of the current RPP, the RCG algorithm connects the constructed RPP to the rest of the reverse program (Box 4 in Figure 19 ) and moves on to the construction of the next RPP. This process is repeated until the end of the program is reached.
PERFORMANCE EVALUATION
We tested the RCG algorithm on an MBX860 evaluation board with a PowerPC (MPC860) processor, and 4MB of DRAM. The reason we specifically chose the MBX860 board is that we wanted to explore the advantages of the RCG algorithm in a real memory-restricted platform (i.e., an "embedded" platform). However, our methodology explained in this article is also applicable to other platforms such as general purpose computers.
In order to test instruction-level reverse execution on a debugging session, we implemented a low-level debugger tool with a graphical user interface (GUI) which provides debugging capabilities such as breakpoint insertion, single stepping, register and memory display (Figure 20) . The debugger runs on a PC with Windows 2000. The PC is connected to the MBX860 board via a Background Debug Mode (BDM) interface [Motorola Inc. 1998 ]. We did not install any operating system on the MBX860, and we ran the benchmarks directly on the processor. Therefore, the measurement results below do not include any operating system related overheads.
The benchmark programs we used for our experimentation are selection sort (SSort), matrix multiply (MMult), an Adaptive Differential Pulse Code Modulation (ADPCM) encoder, and a Lempel Ziv Welch (LZW) code compressor. SSort sorts integer numbers which are input to SSort in an array. MMult multiplies two integer matrices which are input to MMult as arrays and writes the resulting matrix into another array. Finally, ADPCM encoder and LZW read their inputs starting from a location in memory and write the processed outputs back to another location in memory. The input data is written to memory prior to execution of ADPCM encoder and LZW. SSort, MMult, and ADPCM encoder are implemented as a single function each, while LZW is composed of three functions. The main function of LZW calls the other two functions in a loop. One of the functions that is called from main includes a recursive call to itself inside a loop. ADPCM encoder also reads and processes its input in a loop. On the other hand, SSort contains a two-level nested loop, and MMult contains a three-level nested loop. All of the benchmarks are written in the C programming language. In order to compile the benchmarks for the PowerPC 860, we used a compiler from Tasking Inc. [2001] . Note that since we do not enforce any structural constraints on the assembly code that is input to the RCG algorithm, the input assembly code can be generated in any way, even by an optimizing compiler. In our experiments, we compiled each benchmark using level-3 optimizations that include global common subexpression elimination, constant propagation, constant folding, dead code elimination, strength reduction, tail merging, spill-code reduction, loop memory-reference elimination, and global register allocation.
We could not use much larger benchmarks (such as those in the SPEC suite) on the PowerPC platform mainly due to memory limitation of the MBX860 board. If the six person-months spent implementing RCG on the PowerPC board were to be spent implementing RCG on a platform with larger memory (e.g., a regular PC), larger benchmarks could be chosen to measure the performance of RCG. Nevertheless, there are no scalability limitations imposed on the RCG algorithm other than the limitations stated in Section 4. This is because as long as a program satisfies the conditions stated in Section 4, the RCG algorithm generates a reverse program by partitioning the input program into smaller portions as explained in Section 6. Thus, rather than the number of program partitions in a program, it is the characteristics of the program partitions (e.g., the percentage of the instructions that require state-saving for being reversed), that play an important role in determining the efficiency of the RCG algorithm. In short, while the instruction code sizes are under 1000 bytes (excluding data) for all benchmarks we used (see Table I ), we expect the improvements shown by our approach to apply equally well to larger benchmarks.
In order to compare the performance of the RCG algorithm against the previous techniques, we had to expand the previous techniques to support instruction-level reverse execution. Two of the best previous techniques that are expandable to support instruction-level reverse execution without any forward execution are converted into either saving the modified processor state before each instruction-incremental state-saving (ISS)-or saving the modified processor state before each destructive instruction (i.e., an instruction whose target operand is different than the source operands)-incremental state-saving for destructive instructions (ISSDI). We used level-3 optimized assembly code for applying ISS and ISSDI as well.
• T. Akgul and V. J. Mooney III Table I shows the sizes of the compiled benchmarks and the reverses of the benchmarks for ISS, ISSDI, and RCG. Note that since ISS and ISSDI do not actually generate a reverse program, the term "reverse code" as used in ISS and ISSDI refers to the instructions that recover the saved state in ISS and ISSDI.
The reverse code sizes obtained with RCG are approximately 1.18X to 2.19X larger than those that are obtained with ISS and ISSDI. This is because while ISS and ISSDI usually use a simple load instruction to restore a value in a register or a memory location, RCG uses a RIG that may be composed of multiple instructions. Another interesting result is that for SSort, MMult, and LZW it turns out that each reverse code obtained by ISS has the same size as the equivalent reverse code obtained by ISSDI. This is because ISSDI simply removes some of the state-restoring load instructions from the reverse code obtained by ISS and replaces them by an equal number of arithmetic instructions that undo nondestructive instructions. The rest of the instructions, on the other hand, are kept intact.
We first measured the run-time memory usage (for state-saving) with ISS, ISSDI, and RCG. The tests were performed with various input data sizes to explore how run-time memory requirements grow with the problem size. For SSort, the number of elements to be sorted were taken as 100, 1000, and 10000. For MMult, we used 4 × 4, 40 × 40, and 400 × 400 matrices. For ADPCM encoder, we experimented with 32 KB, 64 KB and 128 KB input data sizes. Finally, we ran LZW over 1 KB, 4 KB, and 16 KB of input data.
In this experiment, we ran each benchmark instrumented with state-saving instructions according to ISS, ISSDI, and RCG from the beginning forward, until the end. Then, we measured the total run-time memory used for state-saving. State-saving was applied using circular buffers which never overflow so that we could measure total memory usage even though the memory requirements with ISS and ISSDI in many cases were above the available amount on the MBX860. The results are shown in Table II . As seen in the table, when data sizes are increased, the memory requirements with ISS and ISSDI quickly exceed the 4 MB of available memory on the MBX860, while RCG still provides feasible memory usage except for the case when the input size of SSort is 10000. However, even in this case, the run-time memory requirement with RCG is 82X and 55X smaller than the run-time memory requirements with ISS and ISSDI, respectively. In General, RCG achieves from 2.5X to 2206X and from 2X to 1404X reduction in run-time memory usage as compared to ISS and ISSDI, respectively. We also measured the forward and reverse execution times of the benchmarks for ISS, ISSDI, and RCG. For the execution time measurements, we used the decrementer counter of the PowerPC 860 processor (the PowerPC 860 provides a decrementer counter which can be programmed to decrement at 2.5 MHz on the MBX860; therefore, one tick of the decrementer corresponds to 0.4 microseconds). We measured the forward execution times both for the original benchmarks and for the benchmarks that are instrumented with statesaving instructions. In this way, we could also measure the forward execution time overheads that are caused by ISS, ISSDI, and RCG. To measure the forward execution times, each benchmark was run from the beginning forward until the end, and to measure the reverse execution times, each benchmark was run from the end backward until the beginning.
Tables III and IV depict the execution time results of the benchmarks. The dashes in Table IV correspond to the measurements in which we ran out of memory on the MBX860. As shown in Table IV , in most cases, ISS and ISSDI do not let us reverse execute the whole program because only a small fraction of the required state could be saved to limited memory.
As seen in Tables III and IV , the execution times increase linearly for ADPCM encoder, while they increase exponentially for SSort, MMult, and LZW. This is in accordance with the loop structures in the benchmarks. Since instructions within the innermost loops are the most frequently executed instructions, the execution times of the benchmarks increase almost linearly with the increase in the number of times the innermost loops are executed. Since SSort, MMult, and LZW include nested loops, we observe exponential increase in the measured execution times when the input data sizes are increased.
The slow down in reverse execution with the RCG algorithm as compared to ISS and ISSDI is between 1.16X and 1.89X. This slow down is a direct consequence of larger reverse programs that are generated by the RCG algorithm as compared to ISS and ISSDI. This is the only penalty we have to pay as we gain much from run-time memory as illustrated in Table II . However, since reverse executions are usually followed by forward executions in cyclic debugging, the time loss during reverse execution may be compensated by the reduced forward execution times of the programs with the RCG algorithm. Since ISS and ISSDI instrument the programs with state-saving instructions much heavier than the RCG algorithm, the RCG algorithm achieves faster forward executions than those that are achieved by ISS and ISSDI. The speedup in forward execution with RCG over ISS and ISSDI ranges between 1.2X and 2.32X. This result is also seen by the forward execution time overheads shown in Figure 21 . The forward execution time overhead indicates the percent ratio of the increase in the forward execution time (due to code instrumentation) over the execution time of the original (uninstrumented) code. RCG achieves from 1.5X to 403X reduction in forward execution time overheads as compared to ISS and ISSDI.
The last RCG measurement compares debugging via reverse execution with debugging via program reexecution. In this measurement, we executed the 400 × 400 matrix multiply from the beginning (end) to various intermediate program points in the forward (backward) direction and measured the elapsed times. The intermediate program points correspond to the beginning of the outermost loop of matrix multiply at different iteration instances of this loop. For 400 × 400 input matrices, the outermost loop executes 400 times in total. The results are shown in Figure 22 . Suppose just as the 400th outermost loop (the right-hand side of Figure 22 ) ends, a bug is noticed. Suppose further that the bug source is suspected to be in the 300th loop iteration. Figure 22 shows that while it would take 107 seconds to reverse execute back to iteration 300 using RCG compiled code, it would take 137 seconds to forward execute uninstrumented code (the original code) from loop iteration zero forward to loop iteration 300. In fact, once at the 400th iteration of the outermost loop, for any loop iteration greater than 280, it is faster to reverse execute to the target loop iteration than to forward execute from loop iteration zero to the target loop iteration. In short, Figure 22 empirically demonstrates that for a fairly large set of potential bugs localized close to the current iteration point (in the case of Figure 22 , the iteration point is the 400th iteration of the outermost loop), it is faster to use RCG to go to the bug than to reexecute the original code from the beginning.
One option to speed up the program reexecute approach might be to take periodic checkpoints of the whole program state so that the programmer can restart the program from the nearest checkpoint instead of the beginning of the program. Then, he/she can forward execute from that point on to reach the target point. However, checkpointing of the whole program state requires more memory than incremental state-saving if it is performed frequently enough to provide fast state restoration. For example, assume that we take periodic checkpoints of 400 × 400 matrix multiply every 100 ms to provide a worst-case reverse execution time of 100 ms. To take an absolute checkpoint, typically, we at least need to save the value of every element in the result matrix. Since each element is a 4-byte integer, each checkpoint requires at least 625 KB of memory. With the total run-time of 183 seconds, 400 × 400 matrix multiply requires 1830 checkpoints and thus approximately 1.1 GB of run-time memory. However, RCG requires only 1.2 MB as shown in Table II . Therefore, once the target point is reached, a periodic checkpointing approach cannot achieve the speed of our technique without sacrificing large memory space.
On the other hand, in order to speed up the program reexecute approach with less memory cost when compared to the memory cost of the periodic checkpointing approach explained in the previous paragraph, one can also perform ISS sparsely with large checkpointing intervals (i.e., instead of applying ISS before every instruction). However, ISS usually requires run-time checks to determine which registers/memory locations are modified between two checkpoints and thus may result in extra run-time overhead. To limit the run-time overhead at a level close to RCG, the best ISS approach to apply would be a compiler-based ISS such as the one used in the replay technique by Miller and Choi [1988] . In compiler-based ISS, a data-flow analysis is carried out to determine the state to be saved between two ISS checkpoints.
To compare RCG against the program reexecute approach using a compilerbased ISS, we also applied compiler-based ISS to the 400 × 400 matrix multiply at the basic block level (i.e., state is saved at the beginning of each basic block). The use of compiler-based ISS results in 980 MB run-time memory usage to provide reverse execution over the whole program, while RCG requires only 1.2 MB as shown in Table II . On the other hand, since compiler-based ISS is applied sparsely, the time elapsed to reverse execute one instruction using compiler-based ISS turns out to be 7X larger on average than the time elapsed to reverse execute one instruction using RCG. Therefore, even compiler-based ISS cannot achieve the speed of RCG without sacrificing large memory space.
CONCLUSION
Executing a program repeatedly is an easy and effective debugging method applied by most programmers. However, every time a program is restarted, parts of the program that have already executed without error have to be reexecuted unnecessarily. The unnecessary reexecution of these program parts constitute a significant portion of the debugging time. Even worse, restarting programs that run for very long time periods is simply impractical.
Reverse execution cuts down the time spent for repetitive debugging by localizing program reexecutions around the bugs in a program. When a bug location is missed by executing a program too far, the program state at a point before the bug location can be recovered by reverse execution, and the program can be reexecuted from that point on, without having to restart the whole program.
Conventional techniques rely heavily on saving processor state before the state is destroyed. However, state-saving causes significant memory and time overheads during execution of a program. In an effort to reduce memory and time overheads caused by state-saving, the state-saving frequency can be reduced. However, reducing the state-saving frequency increases the distance between the point where the program is stopped and the closest point from where the program can be restarted, which effectively reduces the benefit of reverse execution.
In this article, a new reverse execution methodology for programs is introduced. To realize reverse execution, the methodology generates a reverse program from an input program by a static analysis at the assembly level. The methodology is new because state-saving can be largely avoided even with programs including many destructive instructions. This cuts down memory • T. Akgul and V. J. Mooney III and time overheads introduced by state-saving during forward execution of programs. Moreover, as a new feature, the methodology provides instruction by instruction reverse execution at the assembly instruction level without ever requiring any forward execution of the program under consideration. In this way, a program can be run backwards to a state as close as one assembly instruction before the current state, which is very useful for debugging programs written in assembly.
Since generation of a reverse program is performed from the assembly instructions of the program under consideration, the methodology introduced in this article provides instruction-level reverse execution for programs without source code. Also, since both the forward code and the reverse code are executed in native machine instructions, these executions can be performed at the full speed of the underlying hardware.
The methodology is currently limited to single-threaded and non-selfmodifying programs that run on a single processor. Furthermore, in cases where the state-regeneration techniques presented in this article are inapplicable (such as exceptions or reversing of file I/O operations), state is recovered by conventional state-saving methods.
The methodology has been implemented for the PowerPC 860 target processor and has been verified using a custom-made debugger. The measurements with the selected benchmarks show from 2X to 2206X reduction in run-time memory usage, from 1.5X to 403X reduction in forward execution time overhead, and from 1.2X to 2.32X reduction in forward execution time. Furthermore, due to the reduction in memory usage, the methodology presented in this article can provide reverse execution in many cases where other methods run out of available memory. However, for cases where there is enough memory available for other methods, our method results in 1.16X to 1.89X slow down in reverse execution.
In conclusion, we have presented a set of algorithms and techniques to enable generation of a reverse program able to reverse execute a program one assembly instruction at a time. Other than a few (hopefully rare) cases (e.g., floating point divide or reading external user input) where state-saving must be applied, our methodology uses a generated reverse program to undo the effects of forward execution of the program's assembly instructions. This is the first time known to the authors that this has been accomplished.
