Instruction-level Reverse Execution for Debugging by Akgul, Tankut & Mooney, Vincent John, III
Instruction-level Reverse Execution for Debugging
Tankut Akgul and Vincent J. Mooney III
Technical Report GIT-CC-02-49
Georgia Institute of Technology
Abstract
Reverse execution provides a programmer with the ability to return a program to
a previous state in its execution history. The ability to execute a program in reverse
is advantageous for shortening software development time. Conventional techniques
for reverse execution rely on saving a state into a record before the state is destroyed.
State saving introduces both memory and time overheads during forward execution.
Our proposed method introduces a reverse execution methodology at the assembly
instruction level with low memory and time overheads. The methodology generates
from a program a reverse program by which a destroyed state is almost always re-
generated rather than being restored from a record. This significantly reduces state
saving.
The methodology has been implemented on a PowerPC processor with a custom
made debugger. As compared to previous work using state saving techniques, the
experimental results show 2.5X to 400X memory overhead reduction for the tested
benchmarks. Furthermore, the results with the same benchmarks show an average of
4.1X to 5.7X reduction in execution time overhead.
1
1 Introduction
As human beings are quite prone to making mistakes, it is very difficult for a programmer to
write an error-free program without testing it. For this reason, debugging is an important
and inevitable part of software development.
Locating bugs by just looking at source code is quite difficult. Consequently, a run-time
interaction with the program is very useful for debugging. Unfortunately, many of the bugs
in programs do not cause errors immediately, but instead the bugs show their effects much
later in program execution. For this reason, even the most careful programmer equipped
with a state of the art debugger might well miss the first occurrence of a bug and thus might
have to restart the program. Furthermore, for difficult to find bugs, this process might
have to be repeated multiple times. Even worse, for intermittent bugs due to rare timing
behaviors, the bug might not reappear right away when the program is restarted.
A typical debugging cycle that many programmers go through is shown in Figure 1. After
an error in the program is detected, the location of the responsible bug or bugs can usually
be determined either immediately or after going through an inner cycle between program
restart and re-execution. However, every time a restart occurs, parts of the program that
already executed without errors have to be re-executed unnecessarily. These unnecessary
















Figure 1: A typical debugging cycle.
Reverse execution provides the programmer with the ability to return to a particular
previous state in program execution. By reverse execution, program re-executions can be
localized around a bug in a program. When the programmer misses a bug location by over-
executing a program, he/she can roll back to a point where the processor state is known to
be correct and then re-execute from that point on without having to restart the program.
This eliminates the requirement to re-execute unnecessary parts of the program every time
a bug location is missed, thus potentially reducing the overall debugging time significantly.
In this report, a novel reverse execution methodology in software is described. The de-
scribed methodology is unique in the sense that it provides reverse execution at the assembly
instruction-level granularity and yet still has reasonable memory and time overheads when
the program is being executed. Note that in the rest of this report, the word “instruction”
refers to an assembly instruction.
2
In Section 2, the main challenges of reverse execution are explained and the motivation
behind this report is stated. In Section 3, related work is presented. In Sections 4 and 5, the
approach of this report is introduced. The experimental results are presented in Section 6.
Finally, Section 7 concludes the report.
2 Background and Motivation
An execution of a program T on a processor P can be represented by a transition among a
series of states S = (S0, S1, ..., Sn) where a state Si can be written as a combination of pro-
gram counter (PCi), memory (Mi) and register (Ri) values of P . From this representation,
reverse execution of a program can be defined as follows:
Definition 2.1 Reverse Execution: Reverse execution of a program T running on a processor P can
be defined as taking P from its current state Si = (PCi,Mi, Ri) to a previous state Sj = (PCj ,Mj , Rj)
(0 ≤ j < i ≤ n). The closest achievable distance between Si and Sj determines the granularity of
the reverse execution. If state Sj is allowed to be as early as one instruction before state Si, then the
reverse execution is said to be at the instruction-level granularity. 2
The simplest approach for obtaining a previously attained state is saving that state be-
fore it is destroyed. However, saving a state during execution of a program introduces two
overheads: memory and time. A solution to reduce memory and time overheads would be
to decrease the frequency of state saving during program execution. However, this prevents
immediate return to an arbitrary point in execution history where state is not saved. There-
fore, in applying state saving, there usually exists a tradeoff between the granuality of reverse
execution and memory/time overheads due to state saving. The less trace data is collected
during execution of a program, the coarser the granularity of reverse execution will be, and
vice versa.
Performance and memory constraints or lack of compiler support usually enforces pro-
gramming of some software components such as small scale embedded applications, firmware
for consumer electronics, DSP libraries and operating system modules like schedulers, high
performance I/O routines or device drivers to be carried out in assembly language. For
instance, the majority of boot code for the computer system of Pathfinder Spacecraft was
written in assembly language because it was critical for the computer to boot up very quickly
in case of a failure [26]. Therefore, during debugging of such software components, program-
mers have to be involved in instruction-level program execution. This is why most of the de-
bugger tools for embedded software contain instruction-level execution views. Thus, reverse
execution at the instruction-level granularity turns out to be very helpful when debugging
these sorts of software components.
During debugging of programs either written in a high-level or a low-level programming
language, programmers typically use single-stepping facility to locate bug locations. It is
not an uncommon circumstance that programmers miss a bug location just by executing one
more step over the next statement or instruction in the program. In such a case, instruction-
level reverse execution provides an extremely fast backup capability.
However, due to the tradeoff between the granularity of reverse execution and mem-
ory/time overheads, providing instruction-level reverse execution by state saving translates
3
into very high memory and time overheads during execution of a program. Therefore, our
goal is to achieve reverse execution at the native instruction level with low memory and time
overheads, which will open the way for addition of a missing feature, instruction-level reverse
execution capability, to state of the art debuggers.
3 Related Work
Reverse execution has been researched in several contexts. In this section, we will mention
previous work according to different application areas of reverse execution and also according
to different techniques applied.
Zelkowitz provides a reverse execution capability by inserting trace statements into the
programming language [30]. Each trace statement includes an option which indicates either
a condition or a label. Program state is captured starting from a trace statement until the
condition indicated by the trace statement is satisfied or until the label indicated by the trace
statement is reached. However, the programmer has to anticipate which parts of the program
he or she might have to re-execute and thus have to insert trace statements beforehand.
Agrawal et al. provide a statement-level reverse execution capability of a program written
in a high-level programming language [2]. They statically associate with each assignment
statement a set of variables, called a change-set, which is modified by that statement. Then,
during the execution, the associated variables in the change-set are recorded for reverse
execution. However, obviously, although this approach provides a statement-level reverse
execution capability, it might cause large memory and time overheads during program exe-
cution, especially with programs which modify the state frequently.
Reverse execution is also applied in so-called replay techniques for efficient debugging
of nondeterministic sequential or parallel programs using either hardware [4, 25] or soft-
ware [11, 20, 23, 24]. In a replay technique, first, the state of a program is saved at a coarser
granularity during execution of the program and then the program state at a finer granular-
ity is reconstructed by replaying the program using previously saved runtime information.
In hardware approaches, state saving is handled by hardware with inflexibility and high cost
but with little or no performance overhead. On the other hand, in software approaches,
state saving is handled by software with flexibility and low cost but with high performance
overhead. A typical drawback of these replay techniques is that since the recorded trace
only keeps partial information about program state, execution can be restarted only at the
beginning of a time interval in execution history but not at an arbitrary program point.
Reverse execution finds its application in a limited sense in the area of debugging opti-
mized code as well [1, 17, 29]. Hennessy introduces the term “currency” of a variable. A
variable is current at a program point if the value of the variable at that program point
is the same as the variable’s expected value which is deduced from the source code. Since
code optimizations such as code motion and dead variable elimination may move or remove
assignments to variables in the object code, the value of a variable at a certain point in the
optimized code may not be equal to the value of the variable at the corresponding point in
the unoptimized code, which causes the variable to be “noncurrent” at that program point.
In such a case, the current value of the variable has to be recovered to provide the user with
a consistent view of the program being debugged. This recovery operation is where reverse
4
execution comes into play. A typical recovery technique in this field is to reevaluate noncur-
rent variables using appropriate definitions of those variables in the program. However, since
the main focus in this area has been on the determination of whether a variable is current
or not at a program point rather than on the recovery of a noncurrent variable, the recovery
techniques applied in this area are generally very restrictive and ineffective. For instance,
Wismuller reports that only 2-5% of all encountered noncurrent variables can be recovered
in his benchmarks [29].
Floyd makes use of reverse execution or backtracking approach in the area of nondeter-
ministic algorithms [13]. A nondeterministic algorithm is an algorithm which may come up
with different solutions to a problem at each run of the algorithm. However, the solution is
not reached by a random process but by intelligently and incrementally constructing a right
path which leads to a success. In Floyd’s approach, whenever a nondeterministic algorithm
enters a path leading to a dead end, the algorithm state at the most recent point where a
decision is made is restored and alternative solutions are seeked from that point on. In this
way, all possible solutions out of a nondeterministic algorithm can be obtained, which essen-
tially converts a nondeterministic algorithm into a deterministic one. This technique turns
out to be very useful for theorem proving in artificial intelligence as well. Floyd achieves
state restoration by defining a reverse operation for each operation in a nondeterministic
algorithm. However, except constructive operations such as “x = x + 1”, reverse operations
are realized by applying state saving.
Reverse execution is also used in computer science education where students can easily
navigate back and forth through well-known algorithms to understand the behavior of such
algorithms. For this purpose, the common technique applied is program animation [6, 9].
Program animation constructs a virtual machine with a reversible set of instructions. Since
these instructions are reversible, the program can be run backwards. However, in program
animation, a program can only be interpreted, which slows down the animation considerably,
and makes it impossible to execute the program using native machine instructions, not even in
the forward direction. Moreover, since reversible instructions are usually constructed as stack
operations, a significant amount of stack memory may be required in program animation.
Two other application areas of reverse execution are optimistic or speculative computa-
tion [14, 15, 18] and fault tolerance [8, 19]. A computation is optimistic if incorrect compu-
tation is allowed during execution. In parallel executions, tasks usually have to block due
to synchronization requirements on shared data. In optimistic parallel executions, blocking
of the tasks on shared data is prevented and the tasks are allowed to execute independently,
which potentially improves the execution performance but at the same time allows incor-
rect computation. Then, errors caused by possible incorrect computations are recovered by
rolling back the computation of erroneous tasks to a point in time where state is known
to be correct. Similarly, reverse execution for fault tolerance is performed by rolling back
in case software errors occur, which is usually seen in places such as database transaction
systems [5, 16].
Rolling back computations or transactions is usually achieved by periodic or incremental
state saving. In periodic state saving [12], the whole processor state is recorded periodically
at certain checkpoints during simulation. Then, a previous state at a checkpoint can be
recovered by restoring that state from the record. However, in this method, a previous
5
state at an arbitrary point that is not a checkpoint cannot be immediately recovered, which
results in a coarser granularity reverse execution. If checkpointing interval is reduced in
an effort to provide a finer granularity reverse execution, memory and time overheads of
state saving are increased. Moreover, recording the whole processor state at each checkpoint
causes redundancy because some portion of the processor state may be kept unchanged
throughout several checkpoints. In incremental state saving [28], instead of recording the
whole processor state, only the modified parts of a state are recorded. However, in programs
where the modified state space is large, memory and time overheads of incremental state
saving might again exceed affordable limits.
Carothers et al. introduce another approach for optimistic parallel simulations [7]. This
approach is source transformation. In source transformation, the source code (e.g., in C)
is transformed to a reversible source code version excluding destructive statements such as
direct assignments. For destructive statements, state saving is applied. Consequently, the
execution time and memory requirement of the transformed code are increased. Source
transformation does not provide reverse execution at the instruction-level granularity, but
instead at the source code granularity.
In the next section, we will give an overview of how we achieve an instruction-level
reverse execution of a program under consideration. Then, the details of our approach will
be explained in Section 5.
4 Overview of our Approach
Our approach is mainly based on regenerating a previously destroyed state rather than
restoring the state from a record. When state regeneration is not possible, however, we
recover a destroyed state by state saving. Therefore, our solution is a hybrid solution between
state regeneration and state saving. In this section, we will explain how we achieve state
regeneration. We will also describe our state saving method in Section 4.3.
Suppose that an execution of a program T on a processor P causes P to attain the
series of states S = (S0, S1, ..., Sn) where the distance between two consecutive states
is one instruction. Now, assume that we can generate another program RT , the reverse of
T , such that when a specific portion of RT is executed in place of T when P is at a state
Si = (PCi, Mi, Ri), the state of P can be brought to a previous state Sj = (PCj, Mj, Rj)
(0 ≤ j < i ≤ n). In other words, RT recovers a previously destroyed state. Then, the
execution of T can be reversed by executing RT in place of T . However, practically, it might
be hard to implement such a program RT . This is due to the following two reasons:
(1) Since both T and RT should run on P and since T and RT are supposed to be placed
at different address spaces in memory, in order to execute RT , the program counter
of P should be moved from the address space of T to the address space of RT . Thus,
executing RT alone to reverse execute T may not restore the program counter value.
(2) Typically in a processor P which runs a program T , only a subset of changing memory
and register states may represent the changing state of T . In other words, an execution
of T on P may cause some indirect changes in some memory and register values and
these changes may not be directly linked with the changes in the state of T . As an
example, an operation such as “c = a + b” in T which runs on processor P may
6
change either the value of a memory location or a register whichever is used to keep
the variable c. However, this operation may also indirectly change the value of the
overflow register of P if an overflow occurs during the addition operation. This sort of
indirect change in the overflow register usually occurs transparently to the programmer
and thus is typically hard to undo.
Therefore, let us define a processor state S ′ = (M ′,R′) which excludes the program
counter (PC) value and which includes only directly modified memory (M ′) and directly
modified register (R′) values (i.e., M ′ and R′ only include the memory locations and reg-
isters that appear as operands of the instructions of T ). Moreover, let us define M ′ to
interchangeably represent either processor cache or main memory, whichever is used for
keeping a program value at a certain time.
Given our definition of state S ′, we now introduce an instruction-level reverse program
as follows:
Definition 4.1 Instruction-level Reverse Program: Suppose that a processor P attains the series








and the preceding state S ′i−1 ∈ S
′, there exists only one instruction which directly modifies a memory
or a register value. Now, suppose that another program RT exists such that when a specific portion of




i), the state of P can be brought to a




j) (0 ≤ j < i ≤ n). If RT contains an executable portion for changing the
state of P from any state S ′i ∈ S
′ to any other previous state S ′j ∈ S
′ (j < i) for any possible state
sequence S ′ during execution of T , then RT is called the instruction-level reverse program of T . 2
Assuming that we can generate an instruction-level reverse program RT of T , we can
recover all memory and register values that are directly modified by T for every possible
execution of T . However, since the program counter value carries important debugging
information, we still have to provide a means for restoring the program counter value. We
solve this problem by leaving the recovery of the program counter value to the debugger
tool. In other words, the debugger tool associates the address of each instruction in T
with the beginning address of the corresponding portion in RT which reverses the effect
of that instruction. In this way, when a part of T is reverse executed by executing the
corresponding portion in RT , the debugger tool restores the value of the program counter
by using the connection between the addresses in T and RT . Similarly, we handle the
recovery of indirectly modified memory/register values which have an effect on T ’s state by
the help of the debugger tool. For more information about recovering indirectly modified
memory/register values, please refer to the Appendix.
In order to be able to generate an instruction-level reverse program RT for a program
T running on a processor P , we should first relate the states in a particular sequence S ′
attained by P to the instructions in T .
Definition 4.2 The Relation of a State Sequence to an Instruction Sequence: The state sequence




n) during an execution of a program T on a processor P can be associated with a
set of instructions in T which complete in a sequence I = (α1, α2, ..., αn) where αi ∈ I (1 ≤ i ≤ n)
changes the state of P from S ′i−1 ∈ S
′ to S′i ∈ S
′. Note that since a state S ′i ∈ S
′ includes neither
7
the program counter value nor indirectly modified memory and register values, the sequence I does not
contain any branch instructions (which modify the program counter value) or the instructions that only
indirectly modify memory or register values. 2
Now, we will define another term reverse instruction group as follows:
Definition 4.3 Reverse Instruction Group (RIG): Suppose that one could generate a group of one
or more instructions denoted by RIGi for an instruction αi ∈ I such that if RIGi is executed with P
being at state S ′i ∈ S
′, the state of P can be brought back to state S ′i−1 ∈ S
′. In other words, RIGi
can undo the effect of αi on P ’s state. We call such a group RIGi a reverse instruction group (RIG).
We state that RIGi is a group consisting of one or more instructions because multiple instructions may
be needed to reverse the effect of αi. 2
Then, the effect of the complete sequence I in Definition 4.2 can be reversed by executing
the corresponding RIGs in an order opposite to the completion order of I, that is, by gener-
ating a sequence such as IRIG = (RIGn, RIGn−1, ..., RIG1) where a reverse instruction
group RIGi ∈ IRIG reverses the effect of αi ∈ I (1 ≤ i ≤ n).
Therefore, in this report, we introduce a static algorithm, the reverse code generation
(RCG) algorithm, which generates a RIG for each instruction (excluding the branch in-
structions and the instructions that only indirectly modify memory and register values) in
a program T and combines the generated RIGs to make these RIGs complete in an order
opposite to the completion order of the instructions in T . Since the instruction sequence
I that may result from an execution of T may vary according to dynamic control flow of
T , the RCG algorithm combines the RIGs by binding the RIG sequence to be executed
during reverse execution to the dynamic control flow information of T . However, since inter-
procedural control flow information is hard to capture statically (e.g., due to indirect function
calls), the RCG algorithm is mainly intra-procedural. That is, the RCG algorithm combines
the RIGs to generate the reverse versions of the procedures/functions in a program rather
than to generate the instruction-level reverse program directly. Then, the RCG algorithm
combines the reverse versions of the procedures/functions by a glue code which may employ
state saving (see Section 4.5).
Note, however, that a perfect separation of a procedure/function F from the other proce-
dures/functions within a program T may not always be possible because there may be calls
to other procedures/functions within the body of F . Therefore, in such a case, the RCG
algorithm first divides F into sub-procedures/sub-functions at the assembly level which are
separated from each other according to calls to other procedures/functions within F . Then,
each sub-procedure/sub-function is treated as if it were a standalone procedure/function.
Thus, unless otherwise stated, we will use “procedure/function” in the rest of the report to
refer to a sub-procedure/sub-function of the RCG algorithm.
Listing 1 shows the pseudo code illustrating the main function of the RCG algorithm.
The RCG algorithm first calls a function, Init RCG, which prepares the data structures that
are used by the RCG algorithm (line 1 of Listing 1). Then, the RCG algorithm enters a main
loop (line 2 of Listing 1) where it analyzes each procedure/function assembly instruction by
assembly instruction in the order the instructions are placed by the compiler (lexical order).
After an instruction is read, the RCG algorithm executes a function, Find CF(), which
8
Listing 1 The main function of the RCG algorithm
Input: A program T
Output: An instruction-level reverse program RT for T
begin
1 Init RCG()
2 for all procedure/function Fi ∈ T do
3 cur instr = address of the first instruction of Fi
4 while there are unread instructions in Fi do
5 α = Read instruction(cur instr) /*read the instruction pointed to by cur instr*/
6 Find CF()
7 if α directly modifies a memory location or a register then
8 RIGα = Gen RIG(α)




13 if (α is in a loop L) ∧ (end of L is reached) then
14 if (L requires another traversal) then
15 cur instr = address of the first instruction of L
16 else
17 cur instr = address of the next instruction in Fi
18 end if
19 else






gradually obtains the intra-procedural control flow information while the program is being
scanned (line 6 of Listing 1). Then, if the instruction that has been read directly modifies a
memory or a register value, the RCG algorithm calls a function named as Gen RIG() (line 8
of Listing 1). Gen RIG() is responsible for the generation of a RIG for the instruction
under consideration. Gen RIG() may immediately generate a RIG such that the generated
RIG employs either state saving or state regeneration or both (see Section 4.3). On the
other hand, if the instruction under consideration is within a loop, Gen RIG() may also
construct a RIG for the instruction partially and may leave the completion of the generated
RIG to the next traversal of the loop provided that the number of traversals over the loop
will not exceed three. This delayed RIG generation is performed in order to avoid state
saving within the loop (see Section 4.3.1). If the number of traversals will exceed three,
however, Gen RIG() generates a RIG by state saving. Finally, if a complete RIG is generated
by Gen RIG(), another function, Combine RIGs(), combines the generated RIG with the
previously generated RIGs using the control flow information obtained by Find CF() (line 10
of Listing 1). At the end of the main loop, when the reverse version of the procedure/function
that is currently being analyzed is completed, the RCG algorithm connects the reverse
version of the current procedure/function to the reverse versions of the previously analyzed
procedures/functions by calling a function named as Combine Procs() (line 23 of Listing 1).
9
In the following sections, we will describe the functions that are called by the main
function of the RCG algorithm.
4.1 Init RCG(): Building the initial data structures of the RCG
algorithm
Since the RCG algorithm operates on each procedure/function separately, the first thing
to do is to determine the procedures/functions from the instructions of the program under
consideration.
The RCG algorithm determines the procedures/functions in a program by constructing a
partitioned control flow graph, PCFG=(N ,E,start,exit), for each procedure/function in the
program. N is the set of nodes, E is the set of edges representing the flow of control between
the nodes and start and exit are the unique entry and exit nodes of the PCFG, respectively.
Each node in the PCFG represents a basic block (BB). A basic block is a single entry, single
exit block of a maximal number of consecutive instructions. Since the PCFG construction
is performed over assembly instructions, a BB in the PCFG may have at most two outgoing
edges, one for the target path and the other for the fall-through path of a conditional branch
instruction ending that BB (i.e., a multi-way branch in a high-level programming construct,
such as a C “switch” statement, is expressed by a combination of two-way branches at the
assembly level).
Listing 2 Init RCG(): Building the initial data structures of the RCG algorithm
Input: A program T
Output: The PCFGs for the procedures/functions in T and the CG for T
begin
1 i = 0
2 repeat
3 PCFGi = φ /*initialize PCFGi to be empty*/
4 PCFGi ∪= start block
5 Label Edges(start block)
6 repeat
7 α = Read the next instruction()
8 if end of current BB is reached then
9 Add current BB to PCFGi
10 Label Edges(current BB)
11 end if
12 until (α = “call”) or (α = “return”)
13 PCFGi ∪= exit block
14 Grow CG()
15 i = i+1
16 until end of the program is reached
end
Listing 2 shows the pseudo code for the function Init RCG(). Init RCG() builds a PCFG
for each procedure/function in the program under consideration by reading the instructions
of the program in a loop (lines 6 to 12 of Listing 2). Init RCG() starts the construction of
PCFGi by inserting a start block at the beginning of PCFGi (line 4 of Listing 2) and by
calling a function Label edges() which labels the outgoing forward edge of the start block
10
(line 5 of Listing 2). Edge-labeling, which will be explained in Section 5.1, is performed to
assist in the determination of the intra-procedural control flow information and the generation
of the RIGs. Then, in the loop, Init RCG() adds BBs to PCFGi until a “call” instruction
(i.e., an instruction that is used to call a procedure/function) or a “return” instruction (i.e.,
an instruction that is used to return from a procedure/function call) is encountered in the
program being analyzed. After a new BB is added to PCFGi, Init RCG() calls Label edges()
to label the outgoing forward edges of the newly added BB (line 10 of Listing 2). When
Init RCG() encounters a “call” or a “return” instruction in the program being analyzed,
Init RCG() ends the construction of PCFGi by adding an exit block to the end of PCFGi
(line 13 of Listing 2). The instruction just after the call or the return instruction, on the other
hand, starts a new procedure/function and thus a new PCFG. After a PCFG is constructed,
Init RCG() calls Grow CG() which gradually constructs another directed graph, a call graph
(CG), for the program under consideration (line 14 of Listing 2). The CG is used for obtaining
the inter-procedural control flow information and will be explained in Section 4.5.
4.2 Find CF(): Finding the intra-procedural control flow infor-
mation
In this section, we will give an overview of the function Find CF() (called from line 6 of List-
ing 1), namely, we will outline how the RCG algorithm obtains the control flow information
of a procedure/function under consideration.
As explained in Section 4.1, each node in the PCFG of a procedure/function designates
a BB. The important property of a BB is that the instructions within a BB always complete
in lexical order. In other words, the completion order of the instructions within the BBs is
not dependent on any condition. This lack of dependence automatically fixes the ordering of
the corresponding RIGs in the reverse code. Therefore, the PCFG construction reduces the
needed intra-procedural control flow information to the information of control flow among
the BBs of a procedure/function only.
A confluence point of edges encountered in the PCFG of a procedure/function is the
only point where a decision has to be made about the control flow during reverse execution.
Therefore, the only information required about the control flow between the BBs of a proce-
dure/function is the information which reveals under which condition a confluence point (or
a join point) in the PCFG of a procedure/function is dynamically reached along a particular
incoming edge to that confluence point.
The incoming edge along which a confluence point is dynamically reached in a proce-
dure/function is decided by the control flow predicates that are associated with each incom-
ing edge to that confluence point. Let us illustrate this with the following example.
Example 4.1 Consider the function foo shown in Figure 2(a) that is written in the C programming
language. The assembly listing (for the PowerPC 860) and the PCFG of foo are shown in Figure 2(b)
and Figure 2(c), respectively. When the RCG analysis arrives at point P shown in the figure, it is
necessary to know along which incoming edge BB4 will dynamically be reached in order for the RCG
algorithm to generate the appropriate branch instructions which will reverse execute foo backwards
from P. The edge to be taken to reach BB4 is decided by the conditional branch instruction at the end
of BB1 which causes the flow of control to be divided into two separate paths before reaching P. The
11
 
int foo(x) { 
    int a, b, c = 3; 
    b = x | 15; 
    a = x / c; 
    if (a > 100) { 
      b = a + 1; 
      c = b - x; 
    } else 
      c = x - b; 
    b = c * a; 
    return (b);     
} 
li r11, 3 
ori r12, r3, 15 
divw r10, r3, r11 
cmpi r10, 100 
bg L1 
sub r11, r3, r12 
b L2 
L1:  addi r12, r10, 1 
sub r11, r12, r3 
L2:  mullw r12, r11, r10 
            blr 
r11 = 3 
r12 = r3 | 15 
r10 = r3 / r11 
r10 > 100 
r12 = r10 + 1 
r11 = r12 – r3 
r11 = r3 – r12 
r12 = r11 × r10 








Figure 2: (a) A simple program in C. (b) Corresponding assembly instructions. (c) Corre-
sponding PCFG.
predicate expression of this conditional branch instruction is shown as r10 > 100 in Figure 2(c). If the
value of the predicate expression r10 > 100 is true, then P is reached along one incoming edge (from
BB3); however, if the value of the complementary predicate expression r10 ≤ 100 is true, then P is
reached along the other incoming edge (from BB2). 2
Therefore, to determine the dynamically taken incoming edge to a confluence point P ,
one needs to find the following two items:
(1) The predicate expression Υi associated with each incoming edge ei to P such that when
the value of Υi for an edge ei becomes true, that edge is taken to reach the confluence
point P . Here, the index i varies from one to n where n is the number of incoming
edges to point P .
(2) The predicate expression Υtrue, among the predicate expressions found in (1), which
becomes true during a specific iteration or execution arriving at the confluence point P .
To find (1), Find CF() uses special labels assigned to the edges of the PCFG describing
the procedure/function under consideration. As will briefly be mentioned in Section 4.3
shortly and then will be described in more detail in Section 5.3.2, edge labels also assist
in finding reaching definitions which are essential for RIG generation. Note that we could
also have used a standard control dependency graph (CDG) [22] analysis to determine the
predicate expressions; however, due to the desire to detect the predicate expressions and
reaching definitions together in an efficient way, edge-labeling is preferred over a CDG anal-
ysis. We introduce the edge-labeling algorithm and then describe the predicate expression
determination in Sections 5.1 and 5.2, respectively.
To find (2), we follow two possible methods. The first method is to save the predicate
values during forward execution of a procedure/function. In this first method, it is sufficient
12
to save the values of n−1 of the n predicate expressions found in (1). Because if none of the
n − 1 predicate expressions happen to be true, then the remaining nth predicate expression
is guaranteed to be true. For instance, in Example 4.1, since there are only two predicate
expressions (namely, r10 > 100 and r10 ≤ 100) that are associated with the two incoming
edges to point P, saving the value of only one of the predicate expressions is sufficient to
determine which edge is taken to reach P. The drawback of this method is that, obviously,
state saving of predicate values causes some memory and time overheads during forward
execution of a procedure/function.
The second method to find (2) is to reevaluate the predicate expressions during reverse
execution. In this second method, it is again sufficient to reevaluate the values of n−1 of the
n predicate expressions found in (1) due to the same reason given in the explanation of the
first method. As an example for the second method, if the predicate expression r10 > 100
– see Figure 2(c) – is chosen to be reevaluated, the value of r10 > 100 can be found once
again at point P by executing the compare instruction “cmp r10, 100” during reverse exe-
cution. In this second method, there is no time nor memory overhead encountered during
forward execution of the procedure/function under consideration. However, if the value of
any variable used in a predicate expression to be reevaluated (e.g., the value of r10 in the
expression r10 > 100) has already been destroyed before reaching the reevaluation point,
then that destroyed value must be recovered during reverse execution before the predicate
expression can be reevaluated. This requirement may cause a slower reverse code to be gen-
erated as compared to the code generated by using the first method. The amount of possible
performance degradation of the reverse code depends on how many destroyed variables need
to be recovered in order to reevaluate the predicate expression under consideration.
Both methods explained in the previous two paragraphs are equally applicable. Since
our primary concern is to reduce memory and time overheads of forward execution, the
second method seems to be more preferable in most cases over the first method. Therefore,
we use the second method as a default method. However, we still provide the programmer
with an option that minimizes a cost function whose main parameters are memory and time
overheads of forward execution and the speed of reverse execution.
4.3 Gen RIG(): Generating a reverse instruction group
The Gen RIG() function (called from line 8 of Listing 1), which will be outlined in this
section, involves the generation of a RIG for every instruction which directly modifies a
register or a memory location in a procedure/function.
Suppose that a definition δdestroy destroys the value D of a variable (a directly modified
register or memory location) V at a procedure/function point as shown in Figure 3. Let us
name the program point just before δdestroy as P. To recover D, one needs to know at what
point in the procedure/function D might be assigned to V . This is exactly the same problem
as finding the definitions of V reaching point P.
Gen RIG() follows a more efficient technique than the common technique of using bit-
vectors to determine reaching definitions at a procedure/function point [22]. The main
reason for the increased efficiency is that Gen RIG() does not require an iterative solution
of data-flow equations. First, Gen RIG() employs a method called value renaming which












D ε {D1, D2, D3} 
Re-executing δ1 recovers D for path 1 
Re-executing δ2 recovers D for paths 2 and 3 
Re-executing δ3 recovers D for path 4 
Extracting V out of µ recovers D for paths 3 and 4 
V = D1 
V = D2 
V = D3 
…= V… 
V = … 
Figure 3: Recovering a destroyed variable.
location. Value renaming is same as the renaming operation in the well-known static single
assignment (SSA) form generation [10]. By value renaming, different definitions of a variable
can easily be distinguished from one another. Then, Gen RIG() uses the labels on the edges
of the PCFG to efficiently find reaching definitions at each procedure/function point. The
details of how value renaming is performed and how reaching definitions are determined will
be described in Sections 5.3.1 and 5.3.2, respectively.
Each statically reaching definition δi of V at point P might correspond to the instance
where D is actually assigned to V (Figure 3). The definition that corresponds to the actual
assignment instance is the definition that dynamically reaches point P.
The definition that dynamically reaches point P depends on the dynamically taken path
to P. However, the path that will actually be taken is typically not known prior to program
execution. Even if the dynamic control flow information of the program is obtained by a
prior path profiling technique, the selection of the instructions to be generated to recover
D should not be a static selection. Otherwise, every time the control flow of the program
changes (e.g., due to different inputs fed into the program), the selection may have to be
changed. Therefore, Gen RIG() uses the following technique to recover D at the proce-
dure/function point where D is destroyed: Gen RIG() generates sets, each of one or more
instructions, where each set of instructions recovers D if a path within a group of one or
more paths is dynamically taken to reach δdestroy. Each group of paths may start either from
a unique definition of V reaching P or from multiple definitions of V reaching P. We will
describe in the next paragraphs how these sets are constructed and for which group of paths
each set can recover D. Gen RIG() generates as many sets as necessary to cover all possible
paths from definitions of V reaching P to the destruction point δdestroy. If only one set is
14
enough, then we say that the generated set “completely” recovers D. However, if more than
one set needs to be generated, then we say that each generated set “partially” recovers D.
If more than one set is generated, the final decision of the correct set to be executed during
reverse execution is bound to the control flow information of the procedure/function under
consideration by generating conditional branch instructions in the reverse code. The pred-
icates of the conditional branch instructions carry the dynamic control flow information of
the procedure/function as explained in Section 4.2. Consequently, the conditional branch
instructions perform a gating task for the sets wherein the correct set to be executed is
automatically selected according to the dynamic control flow of the procedure/function.
Let us now describe how a set of instructions which we will denote by ζ can be generated
to recover D for a particular group of one or more paths to δdestroy and explain which paths
are included in such a group of paths. There are two techniques that are followed to generate
such a set of instructions ζ.
The first technique is to put into ζ the instruction αi of definition δi of V that statically
reaches point P. The “instruction αi of definition δi of V ” refers to the instruction which
computes a value Di for its target operand V at a definition site δi (Figure 3). We will call
this first technique the redefine technique. If δi dynamically reaches point P in a specific
execution of the procedure/function under consideration, then Di is indeed equal to the
destroyed value D in that specific execution. However, if any one of the variables that is used
for computing Di is also destroyed, then the instruction which recovers that variable must be
inserted before αi in ζ and this must be applied recursively for all other modified variables
in the dependency chain. The redefine technique can potentially recover D for every path
from δi to δdestroy. Note that the external value of an input variable (e.g., a global variable
or an input argument) of a procedure/function is certainly not defined (or generated) within
the procedure/function but comes from outside of the procedure/function. Therefore, the
external values input to a procedure/function cannot be recovered by the redefine technique.
As an alternative technique to recover D for a particular group of one or more paths
to δdestroy, Gen RIG() can utilize the possible uses of V on those paths as well (a use of V
indicates that V is a source operand of an instruction). In other words, Gen RIG() can put in
ζ an instruction β which extracts the destroyed value of V out of a use µ (including a possible
use of V by δdestroy itself) on the path between δdestroy and any definition of V that reaches
point P (Figure 3). For example, the value of r1 as is used by an instruction “r3 = r1 − r2”
can be recovered by the instruction “r1 = r3 + r2” (assuming that r3 and r2 have not
been modified and that “r3 = r1 − r2” has been performed either as an integer operation
or as a floating point operation without overflow/underflow). This second technique we call
the extract-from-use technique. However, again, if any other variable in β which is used
for extracting V is also destroyed, then an instruction which recovers that variable must be
inserted before β in ζ and this must be applied recursively for all other modified variables
in the dependency chain. The extract-from-use technique recovers D on every path passing
through the use µ. Consequently, as opposed to the redefine technique, the extract-from-use
technique can recover D on a group of paths which starts from multiple definitions of V
reaching P. Note that the external value of an input variable of a procedure/function may
be used within the procedure/function. Therefore, the input values to a procedure/function
might still be recovered by using the extract-from-use technique. However, the extract-from-
15
use technique is less likely to be applicable than the redefine technique because there might
not always be a use µ on a path to δdestroy and even if a use is available, µ’s operation
might not always allow such an extraction of the value of V . For example, the instruction
“r3 = r1 / r2” might prevent the extraction of r1 or r2 as the result of the division operation
might be truncated due to the limited precision r3 can represent.
Note that Gen RIG() finds the dependencies between the variables by using a directed
acyclic graph (DAG). The details about DAG construction and use are explained in Sec-
tion 5.3.3.
The following example illustrates how a destroyed variable can be recovered at a proce-
dure/function point.
Example 4.2 Recovering a variable: Consider the instruction which overwrites the value of register
r12 in BB4 in Figure 2(c) (we need the overwritten value of r12 because the overwritten value is used
both in BB2 and BB3). Let us name this instruction as α and the analysis points just before α and
just after α as P and P′, respectively. There are two different definitions of r12 reaching P on two
different paths: “r12 = r10 + 1” and “r12 = r3 | 15”. Therefore, the value of r12 at point P is
either “r10 + 1” or “r3 | 15”. Moreover, neither r10 nor r3 is modified after being used to define r12
and before point P′. Therefore, the destroyed value of r12 can be partially recovered on one path by
executing the set of single instruction “r12 = r10 + 1”, and it can be partially recovered on the other
path by executing the set of single instruction “r12 = r3 | 15”. Since there are two different possible
sets, each of only one instruction, the set to be executed for reversing the effect of α should be decided
by the help of a conditional branch as follows:
cmp r10, 100
bg L1
ori r12, r3, 15
b L2
L1: addi r12, r10, 1
L2: ...
The list of instructions just given constitutes an example for recovering a destroyed value by using
the redefine technique. In this example, the extract-from-use technique is also applicable on both
paths ending at P′. In Figure 2(c), after the two definitions of r12 reaching P, there are two uses
of r12 on each path: “r11 = r12 − r3” and “r11 = r3 − r12”. Moreover, neither r11 nor r3
is modified between the points of uses and point P′. Note that these subtractions are performed as
integer operations and thus they are perfectly reversible; even if an overflow/underflow occurs during an
integer subtraction operation, the operation might still be reversible if the hardware does not apply any
kind of truncation to the result (note that in this example, we use the instruction set of PowerPC 860
processor which does not apply any truncation to the result of an integer addition/subtraction operation
with overflow/underflow [21]; therefore, an integer addition/subtraction operation on the PowerPC 860
is reversible regardless of overflow/underflow). Thus, if the point P′ is reached passing through the use
“r11 = r12 − r3”, the destroyed value of r12 can be obtained by executing the set of single instruction
“r12 = r11 + r3”; if P
′ is reached passing through the use “r11 = r3 − r12”, then the destroyed
value of r12 can be obtained by executing the set of single instruction “r12 = r3 − r11”. In other
words, the destroyed value of r12 may also be recovered on a path by extracting the destroyed value out
of a use on that path. Since there are two possible sets, each of one instruction, a conditional branch
16
instruction dynamically decides on the correct set to be executed in order to recover r12. Therefore, a
RIG for α can also be generated as follows:
cmp r10, 100
bg L1
sub r12, r3, r11
b L2
L1: add r12, r11, r3
L2: ...
2
For a path or a group of paths, Gen RIG() chooses among the two recovery techniques
(redefine and extract-from-use) the one that will result in the smallest RIG. Note that in cir-
cumstances where the extract-from-use technique is not applicable at all for a path or a group
of paths (e.g., as may be the case with a floating point division resulting in underflow/overflow
or a division resulting in truncation), the redefine technique (in which the destroyed value
is reevaluated) may potentially be used with as many recursions as needed provided that
the recursions do not end up requiring an external input of the procedure/function under
consideration. However, in order to limit the time cost of reverse execution, the programmer
might prefer not to use a recovery technique requiring the recovery of many other additional
variables used to define the original variable whose redefinition is sought. Therefore, if ad-
ditional recoveries are necessary for a recovery and (i) if the number of additional recoveries
exceeds a predetermined value which is set by the programmer or (ii) if the additional re-
coveries end up requiring an external input of the procedure/function under consideration
(i.e., the recovery of a variable within the procedure/function requires the knowledge of an
input value), Gen RIG() resorts to state saving as illustrated in Figure 4. State is saved by
inserting a push-like instruction into the original code just before the state-changing instruc-
tion for which the state has to be recovered. The inserted instruction saves the state (e.g.,
r9 in Figure 4) that is being modified by the state-changing instruction into a free memory
location that is pointed to by a memory pointer (usually a register) and moves the memory
pointer to the next free location. Then, in the reverse program, a pop-like instruction is
generated which moves the memory pointer to the next value to be restored and restores the
saved value from memory.
 
add r9, r3, r4 
…
 











Figure 4: A diagram which illustrates the state saving method of the RCG algorithm.
17
A push-/pop-like instruction refers to an instruction which works in the same way as an
ordinary push/pop instruction; however, a push-/pop-like instruction can work on any mem-
ory pointer, while a push/pop instruction can work only on the stack pointer. For instance,
PowerPC 860 provides store-update and load-update instructions which can be used as push-
like and pop-like instructions, respectively. Ordinary push and pop instructions are not con-
sidered for state saving in order to not possibly corrupt the stack. If the target architecture
does not support pop-like/push-like instructions which automatically increment/decrement
a memory pointer, save and restore operations are handled by using ordinary store and load
instructions and by incrementing/decrementing a dedicated memory pointer explicitly.
4.3.1 Handling loops
As mentioned before, reverse code generation for a loop requires additional passes over the
loop body for the recovery of some instructions within the loop. The reason is as follows.
A variable modified by an instruction α within a loop L may show a transient behavior
at the early iterations of L until the values that come from outside of L are propagated into
the loop body. Thus, the code that reverses the effect of α may be different for different
instances of α (i.e., for the instances due to different loop iterations). Consider the following
example.
Example 4.3 Figure 5 shows a loop with four instructions. The values obtained by the target
operands of the instructions at successive iterations of the loop are also shown in the figure. As seen
from the figure, it requires three loop iterations until a pattern is observed in the values obtained by the
target operand of the first instruction. The values obtained by this operand are affected by the values
input to the loop and are totally unrelated at the early instances of the first instruction in the loop.
Therefore, it is necessary to reverse the effect of each such instance of the first instruction separately.
2
 
li r1, 0 // r1 = 0 
li r2, 3 // r2 = 3 
li r3, 1 // r3 = 1 
L1:  addi r1, r2, 2 // r1 = r2 + 2 
mulli r2, r3, 3 // r2 = r3 × 3 
addi r3, r3, 1 // r3 = r3 + 1 
b L1 // goto L1 












regular pattern:  
r1(n) = r1(n-1) + 3 
transient 
behavior 
regular pattern:  
r2(n) = r2(n-1) + 3 
transient 
behavior 
regular pattern:  
r3(n) = r3(n-1) + 1 
Figure 5: A simple loop.
In order to capture this transient behavior, the RCG algorithm traverses the loop L
multiple times wherein at each traversal, Gen RIG() generates a distinct set of instructions
which reverses the effect of a single instance of α. In other words, at the first traversal,
Gen RIG() generates a set of instructions, ζ1, which reverses the effect of the first instance
of α; at the second traversal, it generates another set of instructions, ζ2, which reverses the
effect of the second instance of α; and so on.
18
Since the instructions within L repeat exactly, if a set of instructions generated to reverse
the effect of an instance of α makes use of the instructions within L only, then that set can be
used to reverse the effect of all the later instances of α as well. In this way, Gen RIG() can
decide on when to stop the traversals over L. Therefore, at each traversal over L, Gen RIG()
tries to generate a set of instructions that makes use of the instructions within L only.
Ideally, the traversals over L should be repeated until such a set (i.e., a set which makes
use of the instructions within L only) can be constructed . However, we limit the maximum
number of traversals over a loop body to three not only to limit the time cost of the RCG
algorithm but also to limit the length of the reverse code generated for α. If such a set cannot
be constructed within three traversals over L, we apply state saving to generate a RIG which
reverses the effect of all the instances of α. In case state saving can be avoided, on the other
hand, the generated sets of instructions at each traversal (i.e., the sets from ζ1 up to ζ3) are
combined together to produce a RIG for α. The set of instructions to be executed within the
RIG during a specific reverse execution is determined by the help of a loop counter which
distinguishes among different loop iterations.
The following example illustrates reverse code generation for loops.
Example 4.4 Figure 6 shows a symbolic version of the generated RIG for the first instruction α in
the loop of Figure 5. Figure 6 also shows the loop unrolled three times where each unrolled iteration
corresponds to one of the traversals of the RCG algorithm over the loop body.
 
li r1, 0 // δ1: r1 = 0 
li r2, 3 // δ2: r2 = 3 
li r3, 1 // δ3: r3 = 1 
li  rLC, 0 // rLC = 0 
addi r1, r2, 2 // δ4: r1 = r2 + 2 
mulli r2, r3, 3 // δ5: r2 = r3 × 3 
addi r3, r3, 1 // δ6: r3 = r3 + 1 
addi  rLC, rLC, 1 // rLC = rLC+1 
addi r1, r2, 2 // δ7: r1 = r2 + 2 
mulli r2, r3, 3 // δ8: r2 = r3 × 3 
addi r3, r3, 1 // δ9: r3 = r3 + 1 
addi  rLC, rLC, 1 // rLC = rLC+1 
addi r1, r2, 2 // δ10: r1 = r2 + 2 
mulli r2, r3, 3 // δ11: r2 = r3 × 3 
addi r3, r3, 1 // δ12: r3 = r3 + 1 
addi  rLC, rLC, 1 // rLC = rLC+1 
…  
if (rLC = = 0) 
  r1 = 0 
else if  (rLC = = 1) { 
  rt = 3 
  r1 = rt + 2 
} else  { 
  rt = r3 – 1 
  rt = rt – 1 
  rt = rt × 3 
  r1 = rt + 2 
} 
li r1, 0 // r1 = 0 
li r2, 3 // r2 = 3 
li r3, 1 // r3 = 1 
li rLC, 0 // rLC = 0 
L1:  addi r1, r2, 2 // r1 = r2 + 2 
mulli r2, r3, 3 // r2 = r3 × 3 
addi r3, r3, 1 // r3 = r3 + 1 
addi rLC, rLC, 1 // rLC = rLC + 1 
b L1 // goto L1 
α 
RIG for α : 




Figure 6: A diagram which illustrates reverse code generation for loops.
At the first traversal of the loop, Gen RIG() finds the reaching definition of the destroyed register
r1 as “r1 = 0” at point P1. Then, Gen RIG() generates the set ζ1 as “r1 = 0” which reverses the
effect of the first instance of α (i.e., δ4) by using the redefine technique.
At the second traversal of the loop, the definition of r1 to be recovered is the definition that reaches
P2. This definition is “δ4 : r1 = r2 + 2” which comes from within the loop this time. In order to
recover r1 from δ4, Gen RIG() needs the value of r2. However, r2 is destroyed by δ5 between δ4 and
P2. The destroyed definition of r2 is “δ2 : r2 = 3” which comes from outside of the loop. Therefore,
19
Gen RIG() first puts into ζ2 the instruction “rt = 3” which restores the value of r2 into a temporary
register rt using the redefine technique (rt is used instead of r2 to preserve the current value of r2).
Then, Gen RIG() puts into ζ2 the instruction “r1 = rt + 2” which recovers the destroyed value of r1.
At the third traversal of the loop, we are at point P3. The reaching definition of r1 at P3 is
“δ7 : r1 = r2 + 2”. The definition of r2 used in δ7 is “δ5 : r2 = r3 × 3” and is destroyed by δ8
before reaching P3. Therefore, Gen RIG() has to recover r2 before recovering r1. However, r3 as used
in δ5 does not reach point P3, either. Moreover, r3 has been overwritten twice after being used in δ5:
once by δ6 and once by δ9. Thus, Gen RIG() first puts into ζ3 the instructions “rt = r3 − 1” and
“rt = rt − 1” which restore the value of r3 into a temporary rt by using the extract-from-use technique
twice (once on δ9 and once on δ6). Then, Gen RIG() puts into ζ3 the instruction “rt = rt × 3” which
restores the value of r2 into rt by using the redefine technique. Finally, Gen RIG() puts into ζ3 the
instruction “r1 = rt + 2” which recovers the value of r1. Since ζ3 is constructed using instructions
only from within the loop, ζ3 indeed reverses the effect of the later instances of α as well. Therefore,
for this example, it is sufficient to traverse the loop three times to generate a RIG for α without state
saving. As seen in Figure 6, the set (ζ1 or ζ2 or ζ3) to be executed within the generated RIG during
reverse execution is determined by a loop counter (rLC) which is inserted into the original loop. 2
The technique described in this section is applied in a straightforward way to the nested
loop structures as well wherein the passes over the nested loops are completed starting from
the innermost loop going to the outermost loop.
4.4 Combine RIGs(): Combining the reverse instruction groups
The function Combine RIGs() (called from line 10 of Listing 1) combines a RIG with all
the previously generated RIGs to generate an up-to-date, given the RIGs generated so far,
reverse version of the procedure/function under consideration. The pseudo code for the
Combine RIGs() function is given in Listing 3.
Listing 3 Combine RIGs(): Combining the reverse instruction groups
Input: A RIG, RIGα, generated for an instruction α
Output: A linked list of RIGs
begin
1 if α is beginning of a basic block BBk then
2 n = |IncomingEdges(BBk)|
3 if n > 1 then
4 for z = 1 to n − 1 do
5 Generate a set C of conditional branch instructions with the predicates determined by Find CF()
6 Link C to the top of the reverse code
7 end for
8 else if BBk is a target of a conditional branch instruction β in the original code then
9 Generate an unconditional branch instruction ub
10 Link ub to the top of the reverse code
11 end if
12 end if
13 Link RIGα to the top of the reverse code
end
20
As we mentioned in the beginning of Section 4, the generated RIGs should be placed
in a way to make the RIGs execute in an order opposite to the completion order of the
instructions in the original procedure/function. We know that the instructions within a BB
complete in lexical order; therefore, placing the RIGs in the order opposite to the lexical
order of the original BB is sufficient to generate the reverse of that BB. This implies that the
RIGs generated for the BBs are placed in a bottom-up fashion in the reverse code (line 13 of
Listing 3). In other words, if a basic block BBi in the procedure/function under consideration
has a sequence of instructions IBBi = (α1, α2, α3, . . . αn), and if the corresponding RIGs
generated for BBi are RIGBBi = {RIG1, RIG2, RIG3, . . . RIGn}, then the reverse of BBi,
designated as RBBi, consists of the sequence IRBBi = (RIGn, RIGn−1, RIGn−2, . . . RIG1).
Note that since a generated RIG, RIGk (1 ≤ k ≤ n) in IRBBi , may contain branch instructions
(see Example 4.2), an RBB may not necessarily be a single basic block, but instead may
be a combination of multiple basic blocks. The following example shows how the RBBs are
constructed from the BBs of a procedure/function.
Example 4.5 Construction of the RBBs: Figures 7(a) and 7(b) show the PCFG of foo and the
RBBs generated for reverse foo, respectively. The RCG algorithm generates the reverse of each BB
in foo by combining the generated RIGs in bottom-up placement order in reverse foo (line 13 of
Listing 3). While the reverse of BB1, BB2 and BB3 (namely, RBB1, RBB2 and RBB3) are constructed
each as a single BB, the reverse of BB4, RBB4, consists of three separate BBs. RBB4 is separated
into three BBs because the reverse of the instruction “r12 = r11 × r10” in BB4 consists of multiple
instructions two of which are branch instructions (as given in either one of the assembly listings in
Example 4.2). Note that since the initial values of r10, r11 and r12 are input to foo (and thus the
redefine technique is not applicable) and since these initial values are not used in foo (and thus the
extract-from-use technique is not applicable either), the initial values of r10, r11 and r12 are recovered
in RBB1 by the state saving method described in Section 4.3. Figure 7(c) shows the PCFG of foo
instrumented with the state saving instructions. 2
 
RBB2 r11 = 3 
r12 = r3 | 15 
r11 = 3 
restore r10 
restore r12  
restore r11 
r10 > 100 





r11 = 3 
r12 = r3 | 15 
r10 = r3 / r11 
r10 > 100 
r12 = r10 + 1 
r11 = r12 – r3 
r11 = r3 – r12 








r11 = 3 
save r12 
r12 = r3 | 15 
save r10 
r10 = r3 / r11 
r10 > 100 
r12 = r10 + 1 
r11 = r12 – r3 
r11 = r3 – r12 







(b) (c) (a) 
Figure 7: (a) PCFG of foo. (b) RBBs of reverse foo. (c) PCFG of instrumented foo.
21
To generate the reverse version of a procedure/function, the RBBs generated for that
procedure/function should be combined together in an appropriate way. Once again, this
combination should satisfy our argument that the RIGs should execute in the order oppo-
site to the completion order of the instructions in the original program. Combine RIGs()
achieves this by combining the RBBs via the inverted versions of the edges in the orig-
inal procedure/function. Consequently, a confluence point of incoming edges in a proce-
dure/function typically becomes a fork point of outgoing edges in the reverse version of that
procedure/function, and vice versa.
Suppose that a confluence point Po in the original procedure/function becomes a fork
point Fr in the reverse procedure/function. Depending on the number of incoming edges
to Po (or outgoing edges from Fr), Combine RIGs() inserts at Fr one or more conditional
branch instructions which determine which edge to take at Fr during reverse execution. As
described in Section 4.2, Find CF() associates with each incoming edge of Po a predicate
expression which determines along which edge Po is reached. Consequently, the predicate
expressions associated with the incoming edges of Po are directly used as the predicates of
the conditional branch instructions inserted at Fr. The values of these predicates can either
be saved or reevaluated as explained in Section 4.2.
Recall from Section 4.2 that since the predicate expressions found at Po are mutually
exclusive, it is sufficient to save or reevaluate only n − 1 of the n predicate expressions
associated with n incoming edges of Po. Therefore, at the corresponding fork point Fr in the
reverse procedure/function, Combine RIGs() generates n−1 conditional branch instructions,
each using one of the n − 1 predicate expressions found at Po (lines 3 to 6 of Listing 3).
Due to linear orientation of code in memory, the target of one of the n outgoing edges
from Fr immediately follows Fr in address space. Let us name this outgoing edge as e. Since
it is inefficient to generate a conditional branch whose target address is the next address, it is
not appropriate to generate a conditional branch corresponding to e. Therefore, among the
n predicate expressions found at Po, the predicate expression we leave out is always the one
that is associated with the incoming edge of Po whose inverted version is e (see Example 4.6).
On the other hand, suppose that a fork point Fo of two edges in the original proce-
dure/function becomes a confluence point Pr of two edges in the reverse procedure/function
(recall from Section 4.1 that a fork point in the original procedure/function can have at most
two outgoing edges at the assembly level). In this case, it is necessary to establish a link
between the source of each joining edge and the confluence point Pr in the reverse proce-
dure/function. Note that as the RCG algorithm analyzes the instructions on the fall-through
path of Fo before the instructions on the target path of Fo (due to lexical order scanning
of the instructions), the reverses of the instructions on the fall-through path are generated
before the reverses of the instructions on the target path. To keep the bottom-up placement
order, Combine RIGs() places the reverses of the instructions on the fall-through path below
the reverses of the instructions on the target path. Thus, the reverses of the instructions
on the fall-through path of Fo always precede Pr in the reverse procedure/function, which
establishes an automatic link between them. Therefore, the remaining part is to provide the
link between Pr and the source of the joining edge that is the inverted version of the edge
on the target path of Fo. This link is established by inserting an unconditional branch at








reverse_foo instrumented foo 
target path 
r11 = 3 





r10 > 100 
r12 = r3 | 15 r12 = r10 + 1 









cmpi  r10, 100 
bg L1 
addi r12, r10, 1 
b L2 
L1:  ori r12, r3, 15 
L2:  cmpi r12, 100 
ble L3 
li r11, 3 
ori r12, r3, 15 
b L4 
L3:  li r11, 3 
L4:  lwzu r10, -4(r9) 
lwzu r12, -4(r9) 















BB1 r11 = 3 
r12 = r3 | 15 
r10 = r3 / r11 
r10 > 100 
start 
li r9, 0x0   
stwu r11, 4(r9) 
li r11, 3 
stwu r12, 4(r9) 
ori r12, r3, 15 
stwu r10, 4(r9) 
divw r10, r3, r11 
cmpi r10, 100 
bg L1 
sub r11, r3, r12 
b L2 
L1:  addi r12, r10, 1 
sub r11, r12, r3 
L2:  mullw r12, r11, r10 
            blr 
r12 = r11 × r10 
exit 
r11 = r3 – r12 r12 = r10 + 1 
r11 = r12 – r3 
BB4 
BB2 BB3 
The PowerPC 860 instructions “stwu” and “lwzu” are used as push-like and pop-like 










Figure 8: A diagram which illustrates the combination of the RBBs.
The following example illustrates how the RBBs are combined to generate a reverse
version of a procedure/function.
Example 4.6 Combining the RBBs: Figure 8 shows the PCFGs of foo and reverse foo together.
Also seen in the figure are the assembly listings of the instrumented foo (i.e., instrumented with state
saving instructions) and reverse foo. Since the RBBs are combined with the inverted versions of the
edges in the PCFG of foo, the confluence point designated as Po in the PCFG of foo becomes a fork
point designated as Fr in the PCFG of reverse foo, and the fork point designated as Fo in the PCFG
of foo becomes a confluence point designated as Pr in the PCFG of reverse foo. Consequently, a
conditional branch instruction is inserted at point Fr (lines 3 to 6 of Listing 3) and an unconditional
branch instruction is inserted at the head of one of the joining edges at Pr (lines 8 to 10 of Listing 3). The
predicate of the conditional branch inserted at Fr can be determined by observing the following facts:
23
(1) As already shown in Example 4.1, the predicate expression associated with the left incoming edge
of Po is r10 > 100 and with the right incoming edge of Po is r10 ≤ 100. Consequently, control
should be directed from point Fr to RBB3 if r10 > 100 is true and to RBB2 if r10 ≤ 100 is true.
(2) Since the predicate expressions in (1) are mutually exclusive (i.e., they cannot be true at the same
time), using only one of the predicate expressions is sufficient to determine the dynamically taken
edge to Po (and thus the edge to be taken out of Fr).
(3) Note that due to the bottom-up placement order, RBB2 is placed below RBB3; therefore, RBB3
follows point Fr in address space.
(4) We know that a conditional branch instruction directs the control to its target address if the
predicate of the conditional branch is true; otherwise, execution continues with the instruction
after the conditional branch.
Therefore, from (1) to (4), the predicate of the conditional branch at Fr is determined as r10 ≤ 100.
The value of this predicate expression is dynamically determined by executing a compare instruction
“cmpi r10, 100” in reverse foo (i.e., by reevaluating the predicate value during reverse execution).
Note that due to the bottom-up placement order described, an unconditional branch instruction is
placed only at the point that corresponds to the target address of the conditional branch instruction in
foo (the other edge simply falls through, i.e., RBB2 is directly followed in address space by RBB1). 2
4.5 Combine Procs(): Combining the reverse procedures/functions
After generating the reverse version of a procedure/function, Combine Procs() (called from
line 23 of Listing 1) combines the reverse version of that procedure/function with the other
reverse procedures/functions that have already been generated. In order to achieve this,
Combine Procs() must know the control flow information between the procedures/functions
in the program under consideration.
4.5.1 Grow CG(): Determining the inter-procedural control flow
Since the PCFG construction is performed for each procedure/function separately, each
PCFG designates the control flow within a particular procedure/function only. In other
words, a PCFG does not contain any edges which show the flow of control between the pro-
cedures/functions. Therefore, the control flow information between the procedures/functions
is determined by a call graph, CG=(N ,E, s, t), which is built by the function Grow CG()
mentioned in Section 4.1. The set N of CG is the set of nodes designating the proce-
dures/functions in the program and the set E of CG is the set of edges designating the
flow of control between those procedures/functions. The notations s and t designate the
unique entry and exit nodes of the CG. Note that since an indirect call (i.e., use of a
function pointer) whose target procedure/function is statically unknown may potentially
invoke any high-level (or unpartitioned) procedure/function in the program under consider-
ation (we assume a function pointer can only point to the beginning address of a high-level
procedure/function but not to an arbitrary address), Grow CG() inserts an edge from a
procedure/function F to every other procedure/function that F may call if F makes an
indirect call whose target procedure/function is statically unknown. To learn from which
address(es) a procedure/function can be immediately reached and thus to be able to move
the control backwards to a source address, Grow CG() annotates an edge eij ∈ E from a
24
Listing 4 Grow CG(): The CG construction algorithm
Input: A procedure/function Fj for which a PCFG has been generated
Output: A node nj in the CG with a set of edges connected to nj
begin
1 Add a node nj to CG for Fj
2 for all Fk immediately reached from Fj do
3 if (nk = node of(Fk)) 6= NULL then
4 Add to the CG an edge ejk from nj to nk
5 Annotate ejk
6 else
7 Set ejk as pending
8 end if
9 end for
10 if nj has a pending incoming edge then
11 for all eij from ni to nj do





procedure/function Fi to another procedure/function Fj with the address of the instruction
in Fi that transfers control from Fi to Fj.
Listing 4 shows the pseudo code of Grow CG(). Grow CG() adds a new node to the
CG for a procedure/function when a PCFG is built for that procedure/function (see line 14
of Listing 2). After a new node nj is generated for a procedure/function Fj, Grow CG()
checks the procedures/functions which are immediately reachable from Fj. For every proce-
dure/function Fk which is immediately reachable from Fj and for which a PCFG (and thus
a node in the CG) is generated, Grow CG() adds an edge ejk from the node of Fj to the
node of Fk and annotates ejk with the address of the instruction transferring control from
Fj to Fk (lines 2 to 5 of Listing 4). For every other procedure/function which is immedi-
ately reachable from Fj but for which a node has not yet been generated, Grow CG() sets a
pending edge (lines 6 and 7 of Listing 4). Then, Grow CG() checks whether Fj has pending
incoming edges set for it. If Fj has pending incoming edges, Grow CG() adds to the CG
all the pending incoming edges that are set for Fj and annotates those edges appropriately
(lines 10 to 15 of Listing 4).
Example 4.7 Call graph construction: An example program and the corresponding CG are shown
in Figure 9. The example program consists of three high-level functions main, g and h where main
calls g and g makes an indirect call to a high-level function which is not statically known. Since main
and g both contain call instructions, Init RCG partitions main into two functions (m1 and m2), and g
into two functions (g1 and g2) by constructing the corresponding PCFGs (see Listing 2). The edges in
the CG show the transfer of control between the functions and are annotated with the addresses of the
instructions transferring control between the functions. Note that the indirect function call in g1 may
potentially call g1 itself, m1 and h but not g2 nor m2 because the beginning addresses of g2 and m2 do









void main(void) { 
 if (…) 




void g(void) { 
void (*fp)(void); // define a func. ptr. 
… 
fp=…  // set the func. ptr. 



























Figure 9: An example call graph (CG).
4.5.2 Using the CG to combine the reverse procedures/functions
Combine Procs() combines the reverse procedures/functions by systematically inverting the
edges in the CG of the original program T when generating the reverse program RT . Conse-
quently, in RT , Combine Procs() inserts branch or jump instructions at those points which
correspond to the destination locations of the edges in the CG of T . Also, Combine Procs()
may resort to state saving to determine the inter-procedural control flow of T ; therefore,
Combine Procs() may insert into T the instructions which save the required state.
Listing 5 shows the pseudo code for the Combine Procs() function. Combine Procs()
consists of five parts (namely, a, b, c, d and e). Parts a to c are related with the proce-
dures/functions that are immediately reached from multiple static locations in T . Parts d
and e, on the other hand, are related with the procedures/functions that are immediately
reached from a single static location each. If a procedure/function F is immediately reached
from a single static location whose address is A in the program under consideration (i.e.,
there is a single edge coming to the node of F in the CG and that edge has an annotation A),
then in the reverse code, the address RA corresponding to A is the unique address to which
the control has to be directed after the reverse of F , RF , is executed. This is easily handled
by inserting at the end of RF an unconditional branch instruction whose target address is
RA (line e.1 of Listing 5). However, if a procedure/function F is immediately reached from
multiple static locations (i.e., there are multiple edges coming into the node of F in the
CG), the location from which F is immediately reached during a specific execution of the
program and thus the corresponding location in the reverse code to which the control should
be directed after executing RF cannot be obtained statically. Therefore, in such a case,
Combine Procs() applies a dynamic technique, called the stack-tracing technique, to find the
location to which the control should be directed after executing RF . We will describe the
stack-tracing technique in the following paragraphs.
Let us assume that a subset ΦF of the procedures/functions in the program under consid-
eration are immediately reached from multiple static locations. We will designate the set of
the reverses of these procedures/functions as ΦRF . Thus, the remaining procedures/functions
26
Listing 5 Combine Procs(): Combining the reverse procedures/functions
a. At each address Aj ∈ ΦA where a recursive call is made or might be made (we say “might be made” as
Aj might be the address of an indirect call site), insert the instructions which perform the following:
1 check the top entry in M
2 if the flag of the top entry is ‘0’ then
3 if the index of the top entry is j then
4 push a new entry with a flag ‘1’ and a counter ‘2’ over the top entry in M
5 else
6 push a new entry with a flag ‘0’ and an index j over the top entry in M
7 end if
8 else /*the flag of the top entry is ‘1’*/
9 check the entry below the top entry as well
10 if the index of the entry below the top entry is j then
11 increment the counter of the top entry by one
12 else
13 push a new entry with a flag ‘0’ and an index j over the top entry in M
14 end if
15 end if
b. At any other address Aj ∈ ΦA, insert the instructions which perform the following:
1 push over the top entry in M an entry which has a flag ‘0’ and an index j
c. At the end of each procedure/function RF ∈ ΦRF , insert the instructions which perform the following:
1 check the top entry in M
2 if the flag of the top entry is ‘0’ then
3 extract the index id of the top entry and pop the top entry from M
4 else
5 decrement the counter of the top entry by one
6 extract the index id of the entry below the top entry
7 if the counter has reached zero then
8 pop the top two entries from M
9 end if
10 end if
11 find in X the address RAid ∈ ΦRA at the offset id × |A| from B and branch to RAid
d. At the end of each reverse procedure/function RF /∈ ΦRF of which forward procedure/function F is
called indirectly from a unique address Ak ∈ ΦA, insert the instructions which perform the following:
1 check the top entry in M
2 if the flag of the top entry is ‘0’ then
3 if the index of the top entry is k then
4 pop the top entry from M
5 end if
6 else /*the flag of the top entry is ‘1’*/
7 check the entry below the top entry as well
8 if the index of the entry below the top entry is k then
9 decrement the counter of the top entry
10 if the counter has reached zero then




15 execute an unconditional branch to the corresponding address RAk ∈ ΦRA in the reverse code
e. At the end of each reverse procedure/function RF /∈ ΦRF of which forward procedure/function F is
immediately reached from an address A /∈ ΦA, insert the instructions which perform the following:
1 execute an unconditional branch to the corresponding address RA /∈ ΦRA in the reverse code
27
in the program are immediately reached from a single static location each. Also, assume that
there are a total of n locations from which control reaches the procedures/functions in ΦF .
We will designate the addresses of these n locations as ΦA = {A0, A1, . . . An−1} where a
subscript shows the unique index associated with a particular address. We will also designate
the corresponding n addresses in the reverse code as ΦRA = {RA0, RA1, . . . RAn−1}. There-
fore, after executing the reverse of a procedure/function F ∈ ΦF during reverse execution,
control should be directed to an address RAid if and only if the control has reached F from
the corresponding address Aid during forward execution (Aid ∈ ΦA, RAid ∈ ΦRA).
The addresses from which the control is transferred to the procedures/functions in ΦF
during a specific execution of the program under consideration can be found by saving
those addresses during forward execution. However, in a typical processor with a 32-bit
address bus (e.g., PowerPC 860), each address is 32-bits in length. In other words, a total
of 232 different addresses can be accessed through the address bus. On the other hand, in
a typical program, the total number of addresses in ΦA and thus the maximum index in
ΦA is typically much less than 2
32. Therefore, instead of saving the addresses themselves,
the stack-tracing technique saves the indices of those addresses and then matches the saved
indices to the addresses in ΦA. In this way, the memory requirement for keeping track of
the addresses may be reduced. To possibly reduce the memory requirement further, the
indices that are consecutively encountered during program execution and that have the
same value (which may happen with recursive procedure/function calls) are compressed into
two memory locations by the stack-tracing technique. The first memory location holds the
repeating index value and the second memory location holds a counter which shows how
many times the index value repeats. Thus, the stack-tracing technique keeps the following
two data structures to keep track of the addresses from which the control is transferred to
the procedures/functions in ΦF .
The first data structure is to keep the associated indices of the addresses from which
control dynamically reaches the procedures/functions in ΦF during a specific execution of
the program. For this purpose, the stack-tracing technique uses a memory space M which
is accessed like stack. We choose the length of an entry in M to be sixteen bits. Each entry
is a 1-bit flag concatenated with a 15-bit value. Depending on the value of the flag bit, the
15-bit value designates either the index j of an address Aj ∈ ΦA (if the flag bit is ‘0’) or
a counter value (mentioned in the previous paragraph) which tells how many consecutive
index values the next entry in M represents (if the flag bit is ‘1’). When an entry is to be
made into M , the last entry in M is checked for a possible compression opportunity of the
new entry (see line a.1 of Listing 5). In order to prevent an accidental compression of the
first entry into M , which may happen if the irrelevant memory value just before M indicates
a valid index which is same as the first entry into M , the stack-tracing technique initially
inserts into M an entry with a flag ‘0’ and a dummy index which cannot be equal to the
index of any address in ΦA (or ΦRA).
The second data structure is to keep all the addresses in ΦRA to one of which the control
can immediately be directed after executing a reverse procedure/function in ΦRF . Hence, the
indices recorded in M can be matched to the addresses kept in this second data structure.
For this purpose, the stack-tracing technique uses an array X in which all the addresses
RA0 to RAn−1 are consecutively stored starting from a base address, say, B. Therefore, an
28
address RAj (0 ≤ j ≤ n − 1) is placed at a byte offset j × |A| from B where |A| is the
length (in bytes) of an address on the target processor and j is the index corresponding to
RAj (and thus to Aj). Note that obviously, X is constructed after the reverse program is
generated and the addresses in ΦRA are resolved.
Then, the stack-tracing technique inserts instructions both into the original and the
reverse code to invert the control flow between the procedures/functions that are immediately
reached from multiple static locations. The instructions inserted into the original code
handle the bookkeeping task by saving the dynamically encountered indices into M and
apply the mentioned compression for repeating index values (parts a and b of Listing 5).
The instructions inserted into the reverse code, on the other hand, retrieve the saved indices
from M , match the retrieved indices to the addresses stored in X and transfer the control
to the dynamically found addresses in this way (parts c and d of Listing 5).
Example 4.8 Combining the reverse versions of the procedures/functions: Figure 10 illustrates
how Combine Procs() combines the reverse versions of the procedures/functions. A sample function
call history and the corresponding reverse function call order of the example program given in Figure 9
are shown in Figure 10. According to the sample function call history, the program under consideration is
forward executed starting from the beginning of m1 until the end of m2. Then, this execution is reversed
by executing the corresponding reverse program starting from the beginning of rm2 (the reverse of m2)
until the end of rm1 (the reverse of m1). The reverse function call sequence is marked with timestamps













f = 0, i = 3 
f = 0, i = 4 
f = 1, c = 2 
f = 0, i = 2 
f = 0, i = 0 
f = 1, c = 2 








Forward and reverse function call orders 
rm1 rm2 rg1 rg2 rh RA0 RA1 RA2 RA3 RA4 
time instance / 
targets of branches  
1  RA3   RA2 
2    RA3 RA2 
3    RA4 RA2 
4     RA2 
5   RA2  RA2 
6   RA0  RA2 
 


























Figure 10: An example of combining the reverse procedures/functions.
29
Since functions m2, g1 and g2 can be immediately reached from multiple static locations, Com-
bine Procs() inserts indirect branch (branch to link register) instructions to the end of the correspond-
ing reverse functions rm2, rg1, rg2 where the target addresses of these indirect branch instructions are
determined dynamically. On the other hand, since function h is immediately reached from a single static
location, an unconditional branch instruction with the hard-coded target address RA2 of the unique call
location of h (the end of g1) is inserted to the end of the corresponding reverse function rh (line e.1
of Listing 5). Figure 10 also shows a table which indicates the dynamically and statically determined
addresses of the branch instructions and at what timestamp instances those addresses are determined.
The final state of the data structure M at the end of forward execution of the program is shown in
Figure 10. Initially, M contains a dummy index entry to prevent the compression of the first valid entry
into M . At the end of function m1, when a call is made to function g1, the index ‘0’ of the address A0
(which is the address of the call location) is entered into M with a flag of ‘0’ which indicates that the
entry is an index value (line a.6 of Listing 5). Then, at the end of function g1, a recursive call is made
to g1, and the index ‘2’ of the address A2 is entered with a flag ‘0’ over the previous entry in M (line
a.6 of Listing 5). When the call point at the end of g1 is reached again, the index ‘2’ is supposed to be
entered into M again; however, since the index ‘2’ repeats, a counter ‘2’ with a flag ‘1’ is entered into
M instead (line a.4 of Listing 5). Similar steps are followed to enter the rest of the indices into M .
During reverse execution, the stack-tracing technique determines the target addresses of the indirect
branch instructions at the end of the reverse functions by checking the entries in M . At the end of
the reverse function rm2, the top entry in M is checked (line c.1 of Listing 5). Since the top entry
represents a counter value, the counter value ‘2’ is decremented by one (line c.5 of Listing 5) and the
index ‘3’ of the entry below the top entry is extracted (line c.6 of Listing 5). Then, the address RA3
corresponding to the extracted index ‘3’ is found in X and an indirect branch is executed to the found
address RA3 which is the beginning address of rg2 (line c.11 of Listing 5). When the end of rg2 is
reached during reverse execution, the top entry in M is checked (line c.1 of Listing 5). Since the top
entry is again a counter value, the counter value, which is now ‘1’, is decremented by one (line c.5 of
Listing 5) and the index ‘3’ of the entry below the top entry is extracted (line c.5 of Listing 5). However,
since the counter has reached ‘0’, the top two entries in M are popped this time (line c.8 of Listing 5).
Then, the address RA3 corresponding to the extracted index ‘3’ is found in X and an indirect branch is
executed to the found address RA3. Similar steps are followed during the rest of the reverse execution,
which results in the correct ordering of the reverse function calls shown in Figure 10. 2
4.6 Summary of the overall RCG algorithm
The overall RCG algorithm is summarized in the flowchart in Figure 11. The RCG algorithm
first constructs a PCFG with labeled edges for every procedure/function in a program and
constructs the CG of the program (Box 1 in Figure 11). Then, the RCG algorithm enters a
main loop where the instructions of each procedure/function are read one after another and
the reverse program is built. At a confluence point of two or more edges in the PCFG of the
procedure/function currently being analyzed, the algorithm finds the predicate expressions
which control via which incoming edge the confluence point will be reached dynamically
(Group 1 in Figure 11).
After an instruction is read, the RCG algorithm checks whether the instruction directly
modifies a register or a memory value. If yes, the RCG algorithm generates a RIG for the
read instruction (Group 2 in Figure 11).
30
 
Does    directly modify a 
physical location? 
State saving needed? 
End of procedure/function? 
End of program? 
If a complete RIG for    is generated, write 
the RIG into the reverse procedure/function 
Start 
End 
Construct the PCFGs of the procedures/functions, build the CG, label the edges 
of the PCFGs and go to the beginning of the program under consideration 
Connect the reverse version of the procedure/ 
function to the rest of the reverse program 
Determine the predicate 
expressions 
yes Is current point a 
confluence point? 
Group 1: Find_CF() 
Go to the beginning 
instruction of L 
Is    within a loop L and 
end of L reached? 
#traversals(L) < 3  
Read an instruction    
Does    require another 
traversal of L? 
Grow a partial 
RIG for     
Is RIG for    complete? 
Rename values and 
grow DAG 
Is    within a loop L? 
Is L to be traversed 
once more? 
Generate a RIG for 
   with state saving 
Complete the 
RIG for    
Generate a RIG for 
   with state saving 
Generate a RIG for 
    w/o state saving 
Group 2: Gen_RIG() 
Box 1: Init_RCG() 
Box 2: Combine_RIGs() 



















*: See Section 5.3 
Figure 11: A high-level flowchart of the RCG algorithm.
31
As described in Section 4.3.1, some instructions within a loop require more than one pass
over the loop body (excluding the initial pass over the whole program to generate the PCFGs
and the CG) before reverse code can be generated for those instructions without state saving.
Therefore, if an analyzed instruction is inside a loop and the generation of a RIG which does
not use any state saving for reversing the analyzed instruction requires another pass over
the loop body, the RCG algorithm traverses the loop body once more provided that the
total number of passes over the loop body will not exceed three. The maximum number of
traversals is set as three mainly to limit the time cost of the RCG algorithm. If such a RIG
cannot be generated within three passes over the loop body, the algorithm generates a RIG
which uses state saving for reversing the effect of the analyzed instruction.
After a RIG is generated for an analyzed instruction, that RIG is written into the reverse
procedure/function that is currently being constructed (Box 2 in Figure 11). When the
construction of the current reverse procedure/function is completed, the RCG algorithm
connects the constructed reverse procedure/function to the rest of the reverse program (Box 3
in Figure 11).
5 Filling in the Details of the RCG Algorithm
In this section, we will present the detailed descriptions of the PCFG edge-labeling algorithm,
predicate expression determination and the generation of the RIGs which have been omitted
in the overview of the RCG algorithm in Section 4. If a detailed understanding of the RCG
algorithm is not required, the reader may skip this section and move directly to Section 6.
5.1 PCFG edge-labeling algorithm
Edge labeling is performed by the function Label edges() which is called by Init RCG() on
lines 5 and 10 of Listing 2. Label edges() assigns a special label to every forward edge in
the PCFG of a procedure/function. As mentioned before, PCFG labeling is performed for
determining control flow predicates and reaching definitions in an efficient way at a particular
program point. Backward edges are not considered because giving labels to backward edges
helps neither in the determination of the predicate expressions nor reaching definitions.
Recall that since the PCFG construction is performed over assembly instructions, a BB in
the PCFG may have at most two outgoing edges.
Each label assigned to an edge indicates the union of one or more closed intervals on
a bounded nonnegative integer number axis. We name an interval [x,y] as a control flow
interval (CFI) and assign the interval [x,y] to an edge according to the structure of the
program (distinct edges can be assigned the same intervals). As the name CFI implies,
each interval specifies (or encodes) a region of control flow in the PCFG where each region
of control flow consists of all the BBs and forward edges that reside under only one of the
branches (true branch or false branch) out of a conditional branch instruction in the PCFG.
Therefore, each conditional branch instruction (except a conditional branch instruction which
is the source of a backward edge) defines two control flow regions (i.e., true region and false
region) which are separated from one another by that conditional branch instruction. To
better understand the control flow regions, consider the following example.
32
Example 5.1 Control flow regions: Figure 12 shows an example PCFG in which the control flow
regions are marked. In the figure, the edge from BB2 to the exit block falls into the true region of
the conditional branch instruction cb1 at the end of BB2. On the other hand, BB3, BB4 and the edges
connected to BB3 fall into the false region of cb1. As the definition of a control flow region implies,
control flow regions can be nested. For instance, in Figure 12, the false region of cb2 is nested under




addi: add immediate 
lwz: load word 
stw: store word 
cmpi: compare immediate 
blt: branch if less than 
bgt: branch if greater than 
subi: subtract immediate 
b: unconditional branch 






addi  r2, r1, 8 
lwz  r4, 0(r2) 
cmpi r4, 97 
blt  exit 
cmpi r4, 122 
bgt  exit 
subi  r4, r4, 32 
stw r4, 0(r2) 
addi r2, r2, 4 







Figure 12: An example PCFG that shows the control flow regions.
By separating the PCFG of a procedure/function into a hierarchical structure of control
flow regions, the condition under which a specific edge is dynamically visited can be bound to
the predicates of the conditional branch instructions that separate those control flow regions.
We choose to bound the integer number axis between zero and 2t − 1 where t is an
integer that should be greater than the maximum number of nested conditional branches in
a procedure/function body. An unsigned 4-byte integer can represent an integer number axis
bounded between zero and 232−1. Therefore, within an unsigned 4-byte integer, a maximum
of 31 nested conditional branches can be accommodated, a level of nestedness which is hardly
ever seen in a procedure/function. Therefore, for all practical purposes, bounding the integer
number axis between zero and 232−1 will be more than enough for Label edges() to function
correctly. The code for handling greater than 31 nested conditionals is a special case which
will rarely, if ever, be invoked.
Listing 6 shows the operations Label edges() performs on the edges of the BBs in a PCFG.
In Listing 6, the notation Lini,j (L
out
i,j ) designates the label of the j
th incoming (outgoing) for-
ward edge ∈ InFwdEdges (∈ OutFwdEdges) of a basic block BBi. Please note that a label
33
Listing 6 Label edges(): The PCFG edge-labeling algorithm
Input: A basic block BBi
Output: A Label for each outgoing forward edge of BBi
begin
1 if BBi = start block then
2 Louti,1 = [0,2









5 for k = 1 to n do
6 if |OutFwdEdges(BBi)| = 2 then
7 Louti,1 ∪= [xk ,(xk + yk + 1)/2 - 1]
8 Louti,2 ∪= [(xk + yk + 1)/2,yk]
9 else if |OutFwdEdges(BBi)| = 1 then







i,j consists of a set of one or more intervals or CFIs. Label edges() assigns to the
outgoing edge of the start block the label [0,2t − 1] which indicates all of the bounded
nonnegative integer number axis (line 2 of Listing 6). If BBi is not the start block, La-
bel edges() first calculates the union of the labels of the incoming forward edges of BBi where
the union operation is performed on the intervals indicated by the labels of the incoming
forward edges (line 4 of Listing 6). After the union operation, if BBi has two outgoing
forward edges, Label edges() divides each interval designated by the union of the incoming
forward edge labels into two equal portions. Then, Label edges() assigns the union of the
lower portions (coming from each interval) as a label Louti,1 to the outgoing forward edge on
the fall-through path (lines 5, 6, 7 of Listing 6). The union of the upper portions, on the
other hand, is assigned as a label Louti,2 to the outgoing forward edge on the target path (lines
5, 6, 8 of Listing 6). If BBi has only one outgoing forward edge, Label edges() assigns the
union of the incoming forward edge labels to that edge without any change (lines 5, 9, 10 of
Listing 6). The following example illustrates the edge-labeling algorithm.
Example 5.2 Edge-labeling algorithm: Figure 13 shows the PCFG of Figure 12 with its edges
labeled. For this example, the parameter t shown in Listing 6 is chosen as 8. Therefore, the outgoing
edge of the start block is given the label [0,255]. Since BB1 has only one outgoing forward edge, [0,255]
is assigned to BB1’s outgoing forward edge without any change. BB2 has two outgoing forward edges;
therefore, [0,255] is divided into two equal portions [0,127] and [128,255] and each portion is assigned
to one of the outgoing edges. BB3 has two outgoing forward edges as well. Therefore, Label edges()
divides the label [0,127] of the incoming edge of BB3 into two equal portions [0,63] and [64,127] and
assigns each portion to one of the outgoing forward edges of BB3. Since BB4 has no outgoing forward
edge, no labeling occurs for BB4. All the CFIs formed are shown in Figure 14. Note that in this example,












addi  r2, r1, 8 
lwz  r4, 0(r2) 
cmpi r4, 97 
blt  exit 
cmpi r4, 122 
bgt  exit 
subi  r4, r4, 32 
stw r4, 0(r2) 
addi r2, r2, 4 
b loop 
exit 
[64,127] [0,63]  
Figure 13: An example PCFG with labeled edges.
 




false cb: conditional branch 
CFI 3 CFI 2 CFI 1 
Figure 14: The control flow intervals for the PCFG in Figure 13.
5.2 Predicate expression determination
Predicate expression determination is performed by the function Find CF() which is called
by the main function of the RCG algorithm (see line 6 of Listing 1). We gave an overview
of this function in Section 4.2. Now, we will give the details behind the predicate expression
determination.
A confluence point P in a PCFG is dynamically reached via an incoming edge e if
the innermost control flow region in which e resides is dynamically visited. Therefore, the
predicate expression Υ which, when true, causes P to be reached via an incoming edge e will
simply be an appropriate combination of the predicates of the relevant conditional branch
instructions which cause the innermost control flow region which contains e to be visited.
However, a simplification can be made in Υ in certain cases: Suppose that a particular
conditional branch instruction, say cb, defines two control flow regions Rtrue (that is under
the true branch of cb) and Rfalse (that is under the false branch of cb). Suppose further that
Rtrue (or Rfalse) encapsulates the innermost control flow region in which a particular edge
e coming to P resides. Therefore, in order for e to be visited passing through cb during a
35
specific execution of the program under consideration, the predicate of cb must take the true
(or false) value. However, if (1) no other edge coming to P is reached through cb or (2) if
the other edges coming to P that are also reached through cb reside only in Rtrue (or Rfalse)
as well, then the predicate of cb does not play a role in the separation of the condition that
causes P to be visited via e from the conditions that cause P to be visited via the other
incoming edges. This is because of the following reason: in both cases (1) and (2) above,
if P is reached via an incoming edge that is reached through cb, then we definitely know
that the predicate of cb is true (or false); otherwise, we definitely know that cb has not been
evaluated at all. Therefore, in either case (1) or (2), the predicate of cb can be removed from
the predicate expressions determined for the incoming edges that are reached through cb.
Since the edge labels encode control flow regions, determination of the hierarchy of the
control flow regions in which e resides and thus the relevant conditional branch instructions
to use can be accomplished very easily by using the edge labels. Consider the following
example.
Example 5.3 Control flow predicate determination: Suppose that we want to find the predicate
expressions which control via which incoming edge the exit block in Figure 13 will be reached dynami-
cally. The incoming edge labels of the exit block are [64,127] and [128,255] for the left and the right
incoming edges, respectively. As seen in Figure 14, [64,127] corresponds to the CFI where the predicate,
r4 < 97, of the conditional branch cb1 (“blt exit” in Figure 13) is false and the predicate, r4 > 122,
of the conditional branch cb2 (“bgt exit” in Figure 13) is true. On the other hand, within [128,255],
only the predicate r4 < 97 is true. Therefore, the exit block will dynamically be reached via the left
incoming edge if the predicate r4 < 97 is false and the predicate r4 > 122 is true. On the other hand,
the exit block will dynamically be reached via the right incoming edge if the predicate r4 < 97 is true.
Since the CFI which corresponds to the false value of the predicate r4 > 122 is not spanned by any of
the incoming edge labels, the value of predicate r4 > 122 is irrelevant in this case (i.e., a simplification
in the predicate expressions can be applied here). Therefore, the predicate expression associated with
the right incoming edge will be r4 < 97, and the predicate expression associated with the left incoming
edge will be the complementary predicate expression r4 ≥ 97. 2
Note that since backward edges are not labeled, the predicate expressions which deter-
mine whether a loop will dynamically be reached via an incoming backward edge or a forward
edge cannot be found by the method explained. Therefore, in such a case, we follow another
approach which we call the loop counter approach. Before explaining the loop counter ap-
proach, it would be appropriate to make the definitions of some basic terms about a loop as
we will use these terms during the explanation of the loop counter approach.
Definition 5.1 Loop: In a PCFG, a loop is a strongly connected component (SCC) of the PCFG. A
SCC is a subgraph GS = (NS ,ES) of the PCFG, such that there exists a path from every node in NS
to every other node in NS. A loop header Nlh ∈ NS is a node with an incoming edge from a node
which is not in NS . Since a loop may be entered from multiple points, a loop may have more than
one loop header. A loop preheader Nlp /∈ NS is a node which is an immediate predecessor of a loop
header node. A loop is associated with a unique backward edge eb ∈ ES which defines that loop. This
backward edge is the outermost backward edge within the loop. A loop tail Nlt ∈ NS is a node which
is the source of the backward edge eb. 2
36
Since a loop can only be entered through a loop header, finding how a loop is reached
is equivalent to finding how a loop header is reached. Thus, we only consider loop header
blocks. We assume that a loop header block has n incoming backward edges designated as
Eb = {eb1 , eb2 , . . . , ebn} (n ≥ 1) and m incoming forward edges designated as
Ef = {ef1, ef2, . . . , efm} (m ≥ 1). Since each loop is associated with a unique backward
edge, each incoming backward edge in Eb belongs to a different a loop.
Then, the loop counter approach works as follows. Find CF() assigns a dedicated loop
counter to each loop defined by each backward edge in Eb. We will designate these loop
counters as LC = {LC1, LC2, LC3, . . . , LCn}. At each preheader of each loop, Find CF()
inserts an instruction which initializes the corresponding loop counter to zero; and at the loop
tail block of each loop, Find CF() inserts an instruction which increments the corresponding
loop counter by one. Furthermore, at the reverse version of the loop tail block of each loop,
Find CF() inserts an instruction which decrements the corresponding loop counter by one.
Therefore, during forward execution, if a loop header block is reached along an incoming
forward edge of that block, all the loop counters in LC must have a value of zero; otherwise,
at least one of the loop counters in LC must have a value which is greater than zero. Thus,
if there is only one forward edge coming to the loop header block, that forward edge is
associated with the predicate expression Υf=(LC1 == 0 ∧ LC2 == 0 ∧ . . . ∧ LCk == 0);
and if there is more than one forward edge coming to the loop header block, then each
predicate expression that is associated with each forward edge (by the explained method
in the beginning of this subsection) is ANDed with the predicate expression Υf . Each
backward edge ebi ∈ Eb (1 ≤ i ≤ n) of the loop header block, on the other hand, is
associated with a predicate expression LCi > 0 where LCi is the loop counter dedicated to
the loop containing ebi .
Note that a loop counter LC associated with a loop L is preferably kept as a register in
order to minimize memory and time overheads during forward execution. If a free register
cannot be found to keep LC, an occupied register which is not used within L is freed up
by spilling the value in the register into memory at each preheader of L (i.e., just before
L is entered). Then, at the beginning of each BB to which there is an exit from L, the
spilled value is written back to the register used as LC. However, if a suitable occupied
register cannot be found, LC is kept in memory. We illustrate the loop counter approach in
Example 5.7 in Section 5.4.
5.3 Details of reverse instruction group generation
In this section, we will give a detailed description of how, given an instruction α, Gen RIG()
generates a RIG able to reverse the effects of instruction α. Listing 7 shows the pseudo code
of Gen RIG(). Throughout this section, we will explain the functions that are called by the
Gen RIG() function.
We mentioned in Section 4.3 that in order to recover the value of a variable destroyed by
an instruction α, the first thing to do is to find out the reaching definitions for the variable
at the program point just before α (line 12 of Listing 7). This is because the definition
destroyed by α is indeed equal to one of the reaching definitions at a specific execution of the
program under consideration. We also mentioned in Section 4.3 that Gen RIG() applies a
technique called value renaming to find the reaching definitions easily (line 11 of Listing 7).
37
Listing 7 Gen RIG(): Generating a reverse instruction group
Input: An instruction α
Output: A RIG for α
begin
1 if α is the first instruction a basic block BBi then
2 n = |IncomingFwdEdges(BBi)|
3 if n > 1 then
4 Pseudo defns = Generate Pseudo Defns()
5 if Pseudo defns 6= NULL then




10 OT = target operand(s) of α
11 Rename(OT )
12 RdT = Find Reaching Defn(OT )
13 Grow DAG(RdT )
14 Recover(RdT )




Therefore, let us now explain the value renaming operation of Gen RIG() and then explain
how reaching definitions are determined by Gen RIG().
5.3.1 Value renaming
Value renaming is the assignment of a different name to every definition of a variable (i.e., a
directly modified register or memory location). Value renaming is handled by the Rename()
function called by Gen RIG() on line 11 of Listing 7. By value renaming, Gen RIG() can
easily distinguish different definitions reaching a particular point in a PCFG.
In our approach, different renamed values are designated by rji and m
j
k for registers and
memory locations, respectively. Here, i (i = 0, 1, 2, . . . ) and k (k = 0, 1, 2, . . . ) indicate
the physical locations, and j (j = 0, 1, 2, . . . ) indicates the unique index of a particular
renamed value (renamed during program analysis). Index j = 0 is always used to refer to
the initial value of a register or a memory location. Let us give an example of how register
values are renamed in our approach:
Example 5.4 Value renaming for registers: Consider the following instruction sequence:
addi r2, r1, 8 //r2 = r1 + 8
addi r2, r2, 4 //r2 = r2 + 4
The initial values of the registers are given the names r01 and r
0
2 for r1 and r2, respectively. Then,
the first instruction generates a new value designated by r12 by using the values r
0
1 and ‘8’. After that,
the second instruction generates another value designated by r22 using the values r
1
















Figure 15: A typical memory organization made by a compiler.
Renaming memory values is not as easy as renaming register values. This is because a
memory location being written by an instruction is not always apparent within the instruc-
tion encoding, which is the case for indirect addressing (please note that even if a memory
location being written by an instruction is not apparent within the encoding of the instruc-
tion, that memory is still directly modified by the instruction if the instruction encoding
includes at least one operand which is used to point to the modified memory location). Con-
sequently, it might be hard to determine whether two memory stores made by two different
instructions are to a same location or not. Fortunately, there is a way to distinguish the
target memory location of an unambiguous memory store from the other stores even if the
written location is not apparent within the instruction encoding.
Figure 15 shows a memory organization made by a typical compiler. The addresses of
all local stores within a procedure/function can be expressed as a summation of the value of
the frame pointer (or the stack pointer if the frame pointer is not available as a dedicated
register) and the offset used for the store. The addresses of global stores in a program can
be expressed in a similar way, but by using the base address of the global data section of
the executable code in place of the frame pointer [3]. The important point here is that the
base address of the global data section is fixed throughout the execution of a program and
the value of the frame pointer is fixed throughout the execution of a procedure/function.
Therefore, knowing the offset value used for a memory store is sufficient for distinguishing
the target location of that memory store from the target locations of the other memory stores
in the intra-procedural analysis of the RCG algorithm.
The offset values used for unambiguous memory stores (e.g., those for ordinary variables,
pointers with statically known targets and arrays with statically known indices) are statically
apparent in an executable code. This means that the locations of unambiguous memory
stores can be determined statically and, thus, value renaming can be done for those memory
stores without any problem. However, the offset values of ambiguous memory stores (e.g.,
those for pointers aliased to statically unknown variables or arrays with statically unknown
39
indices) are not statically apparent. If an offset value in a memory store cannot be found
statically, Rename() still assigns a distinct name to the stored value as if that value were
written into a physical memory location that had never been accessed before; however, to
be conservative, we assume that the memory store is capable of changing the value of any
memory location. The following example illustrates how value renaming is performed for
memory locations.
Example 5.5 Value renaming for memory locations: Consider the following instruction sequence:
stw r2, 4(r4) //mem[r4 + 4] = r2
stw r5, 8(r5) //mem[r5 + 8] = r5
The first instruction writes the contents of r2 into the memory location at the address r4 + 4 and
the second instruction writes the contents of r5 into the memory location at the address r5 + 8. If
these are local accesses, r4 will be sp+offset1 and r5 will be sp+offset2 where sp is the stack pointer
and offset1 and offset2 are the offsets of r4 and r5 from the stack pointer. Therefore, the first memory
store will be to the address sp+offset1+4, while the second will be to the address sp+offset2+8. If,
for instance, offset1 and offset2 are found to be ‘12’ and ‘8’, respectively, then the two renamed values
for the target operands will be m10 and m
2
0, respectively: in other words, both memory stores will be to
the same memory location. On the other hand, if, for instance, offset1 and offset2 are found to be ‘12’
and ‘4’, respectively, then the two renamed values will be m10 and m
1
1, instead. In other words, the two
memory stores will be to distinct memory locations. However, if the values of offset1 and offset2 cannot
be determined statically, then the two renamed values will be m10 and m
1
1 (i.e., Rename() will name the
written values as if they were written into distinct memory locations) and the physical locations indexed
as m0 and m1 will be behaved as if they might coincide with any physical memory location. 2
5.3.2 Determination of reaching definitions
Reaching definitions at a procedure/function point are determined by Find Reaching Defn()
which is called by Gen RIG() on line 12 of Listing 7. Find Reaching Defn() finds reaching
definitions in a procedure/function using the labels on the forward edges of the PCFG of
that procedure/function. Therefore, the RCG algorithm labels all the forward edges of the
PCFG under consideration prior to reaching definition determination.
Since Find Reaching Defn() determines reaching definitions while the instructions are
read one by one in the main loop of the RCG algorithm shown in Listing 1, loop carried
definitions cannot be determined before the whole loop is read, which requires at least
one complete pass over the loop body. Therefore, during the first traversal of the loop,
Gen RIG() generates the RIGs by using the definitions that come from outside of the loop
only (i.e., reverse code is generated for the first iteration of the loop only) and during the
next traversal, the loop carried definitions are used. However, passes over the loop body
might not be limited to two due to a loop constraint which will be explained in Section 4.3.1.
To determine reaching definitions, Find Reaching Defn() should associate all the defini-
tions encountered during the analysis of a procedure/function with the locations where those
definitions are encountered. Since at most one definition can reach a point from an innermost
control flow region, it is sufficient to associate a definition with the innermost control flow
region in which that definition is made. For this purpose, a table called the renaming table
40
 
Fields/Records r1 r5 m1 m2 
CFI 1     
CFI 2     






Figure 16: The renaming table structure.
is kept by Find Reaching Defn() (Figure 16). The renaming table has a record for every
physical location (e.g., r1, r2, m1, . . . ) that has been modified in a procedure/function up to
the instruction currently being analyzed. As more locations are modified, more records are
added to the renaming table dynamically. Every record in the renaming table has a field for
each CFI produced in a procedure/function body. Initially, all the fields in a newly added
record in the renaming table contain the initial value of the corresponding physical location.
The field(s) to be used for an entry when analyzing a basic block BBi is (are) determined
by applying the following rule:




Fields 7→ {c|xk ≤ L(c) ∧ U(c) ≤ yk, 1 ≤ k ≤ n, c ∈ CFIs}
L(c) and U(c) designate, respectively, the lower and upper bounds of a CFI (as stated at
the beginning of this section, CFI calculation has been done already by an initial pass over
the procedure/function). According to the above rule, a renamed value generated within
BBi is written into the renaming table fields that correspond to the CFIs spanned by the
labels on all incoming forward edges of BBi.
However, applying rule (1) alone does not handle everything necessary for the determi-
nation of reaching definitions as explained. In addition to rule (1), Gen RIG() performs
two more actions: First, as stated in Section 5.3.1, we assume that an ambiguous memory
store (e.g., using an ambiguous pointer) may change any memory location. Due to this as-
sumption, a renamed value generated for an ambiguous memory store and entered into some
renaming table field(s) according to rule (1) deletes the entries in the same field(s) of the
records belonging to other memory locations. Second, as mentioned before in Section 5.1,
the edge-labeling algorithm allows the assignment of the same labels to distinct edges in a
PCFG. This happens when distinct edges merge together at a confluence point in the PCFG,
and after that, they diverge again. If there are two renamed values of a variable where one
of the renamed values is given before a confluence point in the PCFG and the other is given
after the confluence point, the latter may overwrite the former in the renaming table. This
is because both of the renamed values might have to be entered into the same fields due to
41
the assignment of same labels to the edges before and after the confluence point. Conse-
quently, at a point where both definitions reach there statically, the latter definition might
hide the former definition. In order to prevent this situation, when the analysis reaches a
confluence point P in the PCFG of a procedure/function, Gen RIG() combines the distinct
definitions of a variable reaching P under a new pseudo definition by calling a function Gen-
erate Pseudo Defns() (see lines 1 to 4 of Listing 7). The pseudo definition is renamed as
any other ordinary definition and is entered to the renaming table fields that correspond
to the CFIs spanned by the labels on all the forward edges joining at P . However, as will
be described in the next subsection, the combined reaching definitions are not completely
thrown away but are represented by the pseudo definition in another data structure instead.
At a loop header block where a backward edge joins with another incoming edge, Gen RIG()
delays the generation of the pseudo definitions due to the confluence of these edges until the
whole loop is analyzed by Gen RIG(). However, since backward edges are not labeled, edge
labels cannot be used directly to find the loop carried definitions. Therefore, at the end of
each pass over a loop body, Gen RIG() carries the definitions reaching the end of the loop
tail block to the target of the backward edge of the loop (lines 15 to 17 of Listing 7). The
pseudo definitions are similar in concept to the pseudo assignments of φ-functions in the
SSA form generation; however, in the RCG algorithm, no prior search for the places of the
φ-functions takes place [22].
Finally, reaching definitions at a point P during the analysis can be determined simply
by querying the renaming table fields at P . If P is the entrance of a basic block BBi, the
statically reaching definition of a variable V along an incoming forward edge ej of BBi is
the definition in the renaming table fields corresponding to the CFIs that are spanned by
the label on ej. If P is inside BBi, on the other hand, the statically reaching definition of V
is the definition in the renaming table fields which correspond to the CFIs spanned by the
labels on all of the incoming forward edges of BBi (we speak of a unique statically reaching
definition of V along an ej or within a BBi because multiple definitions are merged under
a pseudo definition at confluence points and are represented by that pseudo definition). Let
us now give an example of how reaching definitions are determined using edge labels and the
renaming table.
Example 5.6 Determination of reaching definitions: Consider the PCFG in Figure 17(a). Suppose
that the RCG analysis is currently at the program point shown as P2 in Figure 17(a) and we want to
determine reaching definitions of register r1 at P2. The CFIs and the renaming table generated for
this PCFG are shown in Figures 17(b) and 17(c), respectively (note that the renaming table shows the
entries that are generated up until the current point P2). For the sake of explanation, overwritten
entries are also shown in the renaming table. When the analysis reaches the definition in BB2, a new
value, r11, is generated for r1 and is entered into the renaming table field which corresponds to the CFI
spanned by the label on the incoming edge of BB2: CFI 1. Same operation is repeated for the definitions
in BB3 and BB5 with the corresponding renamed values r21 and r
4
1, respectively. When the confluence
point P1 is reached, Gen RIG() combines definitions of r1 reaching P1 under a pseudo definition that
is renamed as r31, and then Gen RIG() enters r
3
1 into the renaming table fields which correspond to
the CFIs spanned by the labels on the joining edges at P1: CFI 1 and CFI 2. When P2 is reached,






















































127 128 0 255 
cb1 false true 
cb: conditional branch 
CFI 2 CFI 1 





Figure 17: (a) A simple PCFG. (b) Corresponding CFIs. (c) Corresponding renaming table.
the labels on the incoming edges at point P2. The entry corresponding to the left incoming edge (that
of CFI 1) designates that the reaching definition of r1 via the left incoming edge is r
4
1. On the other
hand, the entry corresponding to the right incoming edge (that of CFI 2) designates that the reaching
definition of r1 via the right incoming edge is r
3




1 together. These definitions




1 were not combined
under the pseudo definition r31, r
4
1 would hide the reaching definition r
1
1 at P2 since r
1
1 would have
already been overwritten by r41. 2
5.3.3 Recovery of a destroyed variable
After finding the reaching definition for a variable that is modified by an instruction α,
Gen RIG() generates a RIG which reverses the effect of α by recovering the reaching defini-
tion found for the variable. This recovery is handled by the function Recover() (called from
line 14 of Listing 7). However, before Gen RIG() calls Recover(), Gen RIG() first prepares a
directed acyclic graph (DAG), DAG=(N ,E), for Recover() to use. For constructing a DAG,
Gen RIG() calls the Grow DAG() function (see lines 6 and 13 of Listing 7). Grow DAG()
adds nodes and edges to the DAG both for the renamed values of the operands of α (or for
the definitions made and used by α) and for the pseudo definitions which are generated at
the confluence points. These nodes and edges together specify the relationship (or the data
dependency) of a destroyed reaching definition with the other definitions generated in the
procedure/function. Using this relationship, Recover() can recover the reaching definitions
of the variable modified by α. The sets N and E of DAG include the following:
43
• N={R,M} where R and M are the sets of renamed register and memory values,
respectively.
• There is a directed edge eij ∈ E from node ni ∈ N to node nj ∈ N designated
by ni → nj if (1) ni and nj are the renamed values for target and source operands
of an instruction α, respectively, or (2) ni is a renamed memory value and nj is a
renamed register value determining the location of ni, or (3) ni and nj are the renamed
values for a pseudo definition and a combined definition under that pseudo definition,
respectively.
Therefore, a node is inserted into the DAG for each definition in the procedure/function
under consideration. Multiple definitions of a variable statically reaching a confluence point
are merged under another node in the DAG: the node of the pseudo definition that repre-
sents those multiple statically reaching definitions. At a later confluence point in the proce-
dure/function, a pseudo definition of a variable may again be merged with other pseudo or
normal definitions of that variable reaching that confluence point.
Grow DAG() also applies some annotations on particular nodes and edges in the DAG
to provide the necessary information for the recovery of a destroyed value: in cases (1) and
(2) above, node ni is annotated with the address of α to show for which instruction ni is
generated. In case (3) above, node ni is annotated by a special select (S) operator to show
that ni is generated for a pseudo definition. Also, since a pseudo definition cannot be directly
used to recover a destroyed value (but one of the combined definitions represented by that
pseudo definition can be), in case (3) above, the condition (or the predicate expression)
under which the pseudo definition ni will be equal to the renamed value nj is attached as an
annotation to the edge eij from node ni to node nj.
A node ni in the DAG can have at most one of the following attributes at a point P :
killed, available and partially-available. Node ni is killed at P if the value of ni does not reach
P ; ni is available at P if the value of ni reaches P along all paths; and ni is partially available
at P if the value of ni reaches P along some path controlled by a predicate expression (i.e.,
ni is the value of a combined definition).
Suppose that an instruction αdest destroys the value D of a variable V at a proce-
dure/function point. Let us name the point just before and after αdest as P and P
′, re-
spectively. In order to recover D, Gen RIG() tries to find the reaching definition of V at
point P by calling Find Reaching Defn() on line 12 of Listing 7 (remember that in case
there are multiple reaching definitions of V at point P , these definitions are represented by
a unique pseudo definition due to the merging operation). A definition cannot be found only
if the corresponding entry/entries was/were deleted in the renaming table due to an am-
biguous memory store (see Section 5.3.2). In this case, Recover() recovers D by generating
state saving instructions. If a definition can be found, on the other hand, Recover() finds
in the DAG the node that corresponds to the found reaching definition. Suppose that the
found node is ni. Since D is destroyed by αdest, node ni is killed at point P
′. Now, if one or
both of the following are true at P ′, Recover() can recover ni by generating the appropriate
instructions.
44
(a) All nj’s, where there exists an edge ni → nj, are available and ni and nj’s are the
values of the operands of an instruction α.
(b) An nj, for which there exists an edge nj → ni, is available and all nk’s, nk 6= ni, for
which there exists an edge nj → nk, are available as well. Moreover, ni, nj and all nk’s
are the values of the operands of an instruction β which allows ni to be extracted out
of β.
If (a) holds, ni can be recovered at P
′ by executing α without any change (i.e., by the
redefine technique). On the other hand, if (b) holds, ni can be recovered at P
′ by extracting
ni out of β (i.e., by the extract-from-use technique). In addition, if any node nj that is
needed for recovering ni is partially-available (i.e, nj is the value of a combined definition),
controlled by a predicate expression Υ, then ni might be partially recovered at P
′ (the
predicate expression Υ is obtained by the annotations on the edges coming to nj in the
DAG). To recover ni totally, ni must be partially-recoverable for all values of Υ. In this
case, the reverse code for recovering ni will be gated by Υ. If Υ is destroyed itself, the nodes
determining Υ’s value must be recovered as well. Finally, note that these actions can be
applied recursively, that is, if a node nj that is required to recover ni is killed, then ni might
still be recovered by recovering nj first. If the number of recursions exceeds a predetermined
number which is set by the programmer, or the recovery of a node requires the knowledge
of the value of an external input of the procedure/function under consideration, Recover()
generates state saving instructions to recover the killed node.
5.4 Putting it all together
In this section, we will summarize the detailed operations of the RCG algorithm presented
throughout Section 5.
To determine the intra-procedural control flow, the RCG algorithm uses predicate ex-
pressions which are determined for each edge coming to a confluence point (see Section 4.2).
Predicate expressions for forward edges coming to a confluence point are determined by using
the labels assigned to those edges (see Section 5.1). The predicate expressions for backward
edges coming to a confluence point, on the other hand, are determined by the loop counter
approach introduced in Section 5.2.
The RCG algorithm generates a RIG for an instruction α such that the RIG recovers the
variable(s) directly modified by α. In order to recover a directly modified variable, the RCG
algorithm uses a DAG (see Section 5.3.3). First, the RCG algorithm finds in the DAG the
node for the reaching definition of the directly modified variable (for the determination of
this node, see Section 5.3.2). Since α overwrites this definition, the node of this definition is
killed. Then, in the DAG, the RCG algorithm constructs nodes and edges for the operands
of α. Finally, the RCG algorithm tries to recover the killed node by using the other available
nodes in the DAG (i.e., the nodes that have been constructed for the instructions scanned
before α). For a loop, the RCG algorithm should not use the available nodes that are
constructed for the instructions outside of the loop. If the only available nodes that can be
used for the recovery of the killed node are the nodes that are constructed for the instructions
outside of the loop, the RCG algorithm postpones the recovery to the next iteration of the
loop provided that the total passes over the loop will not exceed three. If the loop has already
45
been traversed three times, the RCG algorithm generates a RIG which employs state saving
(see Section 4.3.1).
Let us illustrate the generation of an instruction-level reverse program with the example
















(#): a timestamp in the analysis 








addi  r2, r1, 8 
lwz  r4, 0(r2) 
cmpi r4, 97 
blt  exit 
cmpi r4, 122 
bgt  exit 
subi  r4, r4, 32 
stw r4, 0(r2) 
addi r2, r2, 4 
b loop 
exit 
[64,127] [0,63]  
(10), (19) 
(0) 
Figure 18: An example PCFG.
Example 5.7 Instruction-level reverse program generation: Figures 19 and 20 show the renaming
table and the DAG, respectively, that are constructed after two passes over the loop body (excluding
the first pass over the whole program to generate the PCFG the CFIs and the CG) in the PCFG of
Figure 18. The renaming table shows the analysis timestamps adjacent to a renamed value when that
renamed value is generated (timestamps are shown in parentheses in Figure 18). The timestamp value
increments by one after each instruction in a procedure/function is scanned. For clarity, the overwritten
entries are again shown in the renaming table (Figure 19).
As an example, consider the analysis point reached after scanning “lwz r4, 0(r2)” at timestamp ‘2’.
The analysis first finds the reaching definition of r4, r
0
4, by querying the renaming table fields which
correspond to the CFIs spanned by the incoming edge label [0,255]. Then, the newly generated value of
r4, r
1
4, is entered into the same fields according to the rule described in Section 5.3.2: the result can be
seen in all the r14(2) entries in Figure 19. Next, a node for r
1
4 is constructed in the DAG and is connected
to the node m00 (m0 designates the memory location at r1+8). Finally, r
0
4 should be recovered. Since




4 has to be recovered by state saving.
Therefore, r04 can be recovered by the load instruction “lwz r4, mem2” where mem2 is the location
where r04 is saved in F . However, since “lwz r4, mem2” is not an instruction within the loop, the
loop condition mentioned in Section 4.3.1 is violated (i.e., r4 is recovered only for the first iteration of
the loop); therefore, another pass over the loop body is necessary. When the analysis reaches the same




4 can be recovered





r12, the other node m
1
0 is connected to, is killed. However, condition (b) given in Section 5.3.3 holds
46
  r1 r2 r4 m0 m1 





















































































































1m  42r  
S: Annotation for the select operator (address annotations are not shown). 







S S S 
2
2r  
rLC = = 0 
 
rLC = = 0 
 
rLC = = 0 
 
rLC > 0 
 
rLC > 0 
 








2r       is drawn as two separate nodes for clarity (i.e., those two nodes are the same node). 
Figure 20: The DAG for the PCFG of Figure 18.
for r12 and thus r
1
2 can be recovered into a temporary register rt by using the available node r
2
2 and
with staying in the loop. The instruction for recovering r12 will then be “subi rt, r2, 4” which extracts
r12 out of the addition instruction “addi r2, r2, 4” (the instruction “addi r2, r2, 4” is found by the
address annotation on r22). Now, condition (b) given in Section 5.3.3 holds for r
2
4 as well, and r
2
4 can
be recovered for the rest of the iterations of the loop by executing the instruction “lwz r4, 0(rt).” A
loop counter (rLC) inserted into the original code is used for differentiating between the loop iterations
as explained in Section 4.3.1. Similar steps are followed for the generation of the RIGs for the other
instructions as well.
After generating a RIG, the RCG algorithm connects the RIG to the previously generated RIGs by
the function Combine RIGs() (see lines 10, 20 and 27 of Listing 1). Figure 21 shows the modified code
on the left and the corresponding reverse code on the right. As explained before in Section 4.4, RIGs
are placed in bottom-up order, and at the boundaries of the BBs, the edges of the original PCFG are
simply inverted by generating the appropriate branch instructions in the reverse code. Consequently, a
join point of edges in the original PCFG typically becomes a fork point of edges in the reverse PCFG,
and vice versa. Note, however, that in this example, since the reverse of BB3 in Figure 21 happens to
be empty (since BB3 does not include any instruction which directly modifies a register or a memory




cmpi r4, 122 
bgt exit 
subi r4, r4, 32 
stw r4, 0(r2) 
addi r2, r2, 4 
addi rLC,, rLC, 1 
b loop 
stw r2, mem1 
addi r2, r1, 8 
li rLC, 0 
stw r4, mem2 
lwz r4, 0(r2) 
cmpi r4, 97 
blt exit 
exit 
subi rLC,, rLC, 1 
subi r2, r2, 4 
addi rt, r4, 32 
stw rt, 0(r2) 
addi r4, r4, 32 
lwz r2, mem1 
cmpi rLC, 0 
bne L1  
lwz r4, mem2  
b L2 
subi rt, r2, 4 
lwz r4, 0(rt) 














bne: branch if not equal 
Figure 21: The original PCFG (left) and the reverse PCFG (right).
the same point in the reverse PCFG. Therefore, these inverted edges are merged together into a single
edge. If this were not the case, a conditional branch instruction of which predicate is determined as
explained in Section 5.2 would be inserted at the end of the start block in the reverse code. 2
6 Experimental Results
We tested the RCG algorithm on an evaluation board with a PowerPC (MPC860) proces-
sor. In order to test reverse execution on a debugging session, we implemented a low-level
debugger tool with a graphical user interface (GUI) that provides debugging capabilities
such as breakpoint insertion, single stepping, register and memory display (Figure 22). The
debugger runs on a PC with Windows 2000. The PC is connected to the PowerPC board
via a Background Debug Mode (BDM) interface [21].
 
 









Figure 22: The GUI of the debugger tool.
48
The benchmark programs we used for our experimentation are a Fibonacci number gen-
erator (FNG) with 100 iterations, a selection sort (SS) with 10 inputs, a 3 by 3 matrix
multiplication (MM) and a random number generator (RNG) with 100 iterations. FNG
generates a Fibonacci series which includes 100 numbers. FNG does not write the generated
numbers into memory, but calculates the numbers in a local variable. SS sorts 10 numbers
which are input to SS in an array. SS is an in-place sort algorithm. In other words, the
numbers in the input array are sorted within the input array. This means that SS does not
allocate any additional temporary storage to sort the input data. MM multiplies two 3 by 3
integer matrices that are input to MM as arrays and writes the resulting matrix into another
array. Finally, RNG generates 100 pseudo random numbers in a sequence. Similar to FNG,
RNG does not use main memory to keep the generated numbers. All of the benchmarks
are written in the C programming language. In order to compile the benchmarks for the
PowerPC 860, we used a compiler from Tasking, Inc. [27]. Note that we compiled each
benchmark using standard optimizations such as common subexpression elimination, con-
stant propagation, constant folding, dead code elimination, strength reduction and global
register allocation. Therefore, in our experimentation, the RCG algorithm generates an
instruction-level reverse program for each benchmark by using optimized assembly code as
input to the RCG algorithm. In this way, we also show that compiler optimizations do not
limit the applicability of the RCG algorithm. Table 1 depicts the size of each benchmark in
terms of the total number of lines of C and the total number of assembly instructions.
Table 1: The sizes of the benchmarks.
 FNG SS MM RNG 
#C lines 12 16 18 14 
#assembly instructions 15 37 59 35 
 
In order to compare the performance of the RCG algorithm against the previous state
saving techniques, we had to expand the previous techniques to support instruction-level
reverse execution. Some of the previous techniques introduced in Section 3 are not applicable
for instruction-level reverse execution at all (e.g., program animation does work on a compiled
code); and the applicable ones, once expanded to support instruction-level reverse execution,
are converted into either saving the modified processor state before each instruction (i.e.,
incremental state saving) or saving the modified processor state before each destructive
instruction (i.e., incremental state saving for destructive instructions).
Tables 2 and 3 show memory and time overhead results of the RCG algorithm, the
ordinary incremental state saving (ISS) and incremental state saving for only destructive
instructions (ISSDI). The memory overhead measurements were performed in the following
way: For ISS and ISSDI, we calculated the program points where state saving is needed
with each benchmark and we instrumented each benchmark with memory store instructions
which save state at the calculated points. Then, we applied the RCG algorithm to each orig-
inal benchmark in order to obtain the modified benchmarks instrumented with state saving
instructions at necessary points for the RCG algorithm as well. For the three sets of the
49
Table 2: Memory overheads.
 FNG SS MM RNG 
ISS 1.6 kB 1.9 kB 1.9 kB 8.8 kB 
ISSDI 1.2 kB 1.5 kB 1.1 kB 5.6 kB 
RCG algorithm 0.004 kB 0.6 kB 0.2 kB 0.8 kB 
 
Table 3: Time overheads.
 FNG SS MM RNG 
ISS 109 % 107.3 % 132.4 % 146.4 % 
ISSDI 85.4 % 90.7 % 84.3 % 100.8 % 
RCG algorithm 13.4 % 38.9 % 28.6 % 20.6 % 
 
instrumented benchmarks (i.e., for the benchmarks instrumented according to ISS, ISSDI
and the RCG algorithm), we also inserted just after each added memory store instruction
an addition instruction which increments a counter C. An addition instruction inserted at
a location increments C by the number of bytes the memory store instruction just before
that addition instruction is storing. For example, if a memory store instruction stores four
bytes, C is incremented by four. On the other hand, if a memory store instruction stores two
bytes, C is incremented by two. In this way, we could calculate the total number of bytes
the memory store instructions required for state saving. For time overhead measurement,
we used the decrementer counter of the PowerPC 860 processor (the PowerPC 860 provides
a decrementer counter which is decremented by one at a certain number of processor cycles).
First, we ran each benchmark without any instrumentation and noted the execution time (in
number of processor cycles) using the change in the decrementer counter. Then, we ran mod-
ified benchmarks instrumented only with the necessary state saving instructions and noted
the execution time (in number of processor cycles) in the same way. Finally, we calculated
the time overhead for each benchmark by taking the differences between the noted execution
times with instrumentation and the noted execution times without instrumentation.
Note that as mentioned in Section 4.3, the programmer chooses the total number of
additional recoveries the RCG algorithm is allowed to make to recover a variable. For the
measurements, we specified this number as three. Therefore, if the RCG algorithm cannot
recover a variable even after recovering three additional variables, the RCG algorithm saves
state to recover the variable.
Figures 23 and 24 show memory and time overhead comparisons, respectively, between
the RCG algorithm, ISS and ISSDI. The results indicate that the RCG algorithm achieves
from 3.17X to 400X and from 2.5X to 300X reduction in memory overhead as compared
to ISS and ISSDI, respectively (Figure 23). Furthermore, the RCG algorithm achieves an





ISS 1.6 1.9 1.9 8.8
ISSDI 1.2 1.5 1.1 5.6
Our algorithm 0.004 0.6 0.2 0.8


































1.33X 1.27X 1.73X 
400X 3.17X 9.5X 11X 
1.9  1.9 
 8.8 
1.5  1.1 
 5.6 

















RCG alg  






ISS 109 107.3 132.4 146.4
ISSDI 85.4 90.7 84.3 100.8
RCG algorithm 13.4 38.9 28.6 20.6

































Figure 24: Time overhead comparison.
to ISS and ISSDI, respectively (Figure 24). For the RCG algorithm, the relatively higher
memory and time overheads that result from the measurements with SS as compared to
measurements with the other benchmarks are mainly due to ambiguous memory stores SS
uses. These ambiguous memory stores happen because the individual array elements that SS
overwrites during a sort operation depends on the initial ordering of the array elements. On
the other hand, memory and time overheads encountered with the RCG algorithm during
the execution of FNG only come from the loop counter inserted within the FNG loop. For
this reason, a much bigger (300X - 400X) overhead reduction is achieved with the RCG
algorithm as compared to previous approaches.
7 Conclusion
In this report, a new reverse execution methodology for programs is introduced. To realize
reverse execution, the methodology generates a reverse program from an input program by
a static analysis at the assembly level. The methodology is new because state saving can
be largely avoided even with programs including many destructive instructions. This cuts
down memory and time overheads introduced by state saving during forward execution of
programs. Moreover, the methodology provides instruction by instruction reverse execution
at the assembly instruction level without ever requiring any forward execution of the pro-
51
gram. In this way, a program can be run backwards to a state as close as one assembly
instruction before the current state.
Since generation of the reverse program is performed from the assembly instructions of
a program, the methodology introduced in this report provides reverse execution capability
for programs without source code. Also, since both the forward code and the reverse code
are executed in native machine instructions, these executions can be performed at full speed
of the underlying hardware.
References
[1] A. Adl-Tabatabai and T. Gross. Detection and recovery of endangered variables caused
by instruction scheduling. In Proceedings of the ACM SIGPLAN’93 Conference on
Programming Language Design and Implementation, pages 13–25, 1993.
[2] H. Agrawal, R. A. DeMillo, and E. H. Spafford. An execution backtracking approach
to program debugging. IEEE Software, 8(3):21–26, May 1991.
[3] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniques, and Tools.
Addison-Wesley, MA, 1986.
[4] D. F. Bacon and S. C. Goldstein. Hardware-assisted replay of multiprocessor programs.
Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, pub-
lished in ACM SIGPLAN Notices, 26(12):194–206, December 1991.
[5] B. K. Bhargava. Concurrency control in database systems. Knowledge and Data Engi-
neering, 11(1):3–16, 1999.
[6] M. R. Birch, C. M. Boroni, F. W. Goosey, S. D. Patton, D. K. Poole, C. M. Pratt, and
R. J. Ross. Dynalab. ACM SIGCSE Bulletin, 27(1):29–33, March 1995.
[7] C. Carothers, K. Perumalla, and R. Fujimoto. Efficient optimistic parallel simulations
using reverse computation. ACM Transactions on Modeling and Computer Simulation,
9(3), July 1999.
[8] K. M. Chandy and C. V. Ramamoorthy. Rollback and recovery strategies for computer
programs. IEEE Transactions on Computers, 21(6):546–556, June 1972.
[9] P. Crescenzi, C. Demetrescu, I. Finocchi, and R. Petreschi. Reversible execution and
visualization of programs with leonardo. Journal of Visual Languages and Computing
(JVLC), 11(2):125–150, April 2000.
[10] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently com-
puting static single assignment form and the control dependence graph. ACM Trans-
actions on Programming Languages and Systems, 13(4):451–490, October 1991.
[11] S. I. Feldman and C. B. Brown. Igor: A system for program debugging via reversible
execution. In Workshop on Parallel and Distributed Debugging, pages 112–123, 1988.
52
[12] J. Fleischmann and P.A. Wilsey. Comparative analysis of periodic state saving tech-
niques in time warp simulators. In Proceedings of the Ninth Workshop on Parallel and
Distributed Simulation, pages 50–58, 1995.
[13] R. W. Floyd. Nondeterministic algorithms. Journal of the ACM, 14(4):636–644, October
1967.
[14] R. M. Fujimoto. Time warp on a shared memory multiprocessor. Transactions of the
Society for Computer Simulation International, 6(3):211–239, July 1989.
[15] F. Gomes. Optimizing Incremental State Saving and Restoration. PhD thesis, Depart-
ment of Computer Science, University of Calgary, 1996.
[16] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan
Kaufmann, 1993.
[17] J. Hennessy. Symbolic debugging of optimized code. ACM Transactions on Program-
ming Languages and Systems, 4(3):323–344, July 1982.
[18] D. A. Jefferson. Virtual time. ACM Transactions on Programming Languages and
Systems, 7(3):404–425, July 1985.
[19] Y.-H. Lee and K. G. Shin. Design and evaluation of a fault tolerant multiprocessor using
hardware recovery blocks. IEEE Transactions on Computers, 33(2):113–124, February
1984.
[20] B. P. Miller and J. Choi. A mechanism for efficient debugging of parallel programs.
In Proceedings of the SIGPLAN’88 Conference on Programming Language Design and
Implementation, pages 135–144, 1988.
[21] Motorola Inc. MPC860 PowerQUICC Users Manual, 1998.
http://e-www.motorola.com/brdata/PDFDB/docs/MPC860UM.pdf.
[22] Steve S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann,
San Francisco, CA, 1997.
[23] R. H. B. Netzer and M. H. Weaver. Optimal tracing and incremental reexecution for
debugging long-running programs. In Proceedings of the ACM SIGPLAN’94 Conference
on Programming Language Design and Implementation, pages 313–325, 1994.
[24] D. Z. Pan and M. A. Linton. Supporting reverse execution of parallel programs. In
Workshop on Parallel and Distributed Debugging, pages 124–129, 1988.
[25] R. Sosic. History cache: Hardware support for reverse execution. Computer Architecture
News, 22(5):11–18, December 1994.
[26] Steven Stolper. Questions and answers about the mars pathfinder, October 1997.
http://www.quest.arc.nasa.gov/mars/ask/about-mars-path/.
53
[27] Tasking Inc. Tasking C/C++ Compiler Datasheet, 2001.
http://www.tasking.com/products/PPC/ppc-ds21.pdf.
[28] D. West and K. S. Panesar. Automatic incremental state saving. In Proceedings of the
Tenth Workshop on Parallel and Distributed Simulation, pages 78–85, 1996.
[29] R. Wismuller. Debugging of globally optimized programs using data flow analysis. In
Proceedings of the ACM SIGPLAN’94 Conference on Programming Language Design
and Implementation, pages 278–289, 1994.
[30] M. V. Zelkowitz. Reversible Execution as a Diagnostic Tool. PhD thesis, Department
of Computer Science, Cornell University, 1971.
54
APPENDIX
We have already stated that the generated reverse code only recovers memory or register
values that are directly modified by the instructions and have explained how this is performed
in the report. The remaining memory and register values that are not recovered by the
generated reverse code are those values that are indirectly modified by the instructions. In
this Appendix, we will answer how we take care of the effects of indirectly modified memory
and register values.
A value modified by an instruction is directly modified if the memory location or the
register holding the value appears as an operand of the instruction; otherwise, the value
modified by the instruction is indirectly modified. As an example, while the target operand
of an “xor” instruction is directly modified by the “xor” instruction, a branch condition
register may be indirectly modified by a “compare” instruction even if the branch condition
register may not be an operand of the “compare” instruction.
Let us designate the set of instructions of a processor P with I. We define a set, say E
(E ⊂ I), of instructions of P such that the outcome of an instruction in E does not depend
on any indirectly modified memory location or register but only on that instruction’s source
operands which are directly modified by other instructions in I. The set of instructions
outside of E, designated as E ′ (E ′ ⊂ I), on the other hand, are affected by indirectly
modified memory and/or register values.
Example A.1 Consider ordinary integer addition and conditional branch instructions. The outcome
of an ordinary integer addition instruction such as “add r1, r2, r3” in a program is only affected by the
values of r1 and r2 both of which are directly modified by other instructions in the program. Therefore,
an “add” instruction is an element of E. On the other hand, the outcome of a conditional branch
instruction such as “bne target” may depend on the value of a branch condition register which might
be indirectly modified by a compare instruction. Therefore, a conditional branch instruction may be
included in E ′. 2
Thus, we can omit the reverse code generation for the recovery of indirectly modified
values if we can correctly undo the instructions in E ′.
Therefore, the instructions in E ′ are specially treated as follows: Let us assume that
the outcome of an instruction α depends on the value V of an indirectly modified memory
location or register. Also, assume that an instruction, say β, computes V indirectly. Then,
whenever α is to be reverse executed alone, the debugger tool reevaluates V by re-executing
the instruction β in the background. If the instruction β has dependencies on other memory
and/or register values, those register and memory values are recovered into temporaries prior
to reevaluation of V . Let us explain this in the following example:








The outcome of the conditional branch instruction depends on the value of the branch condition
register which does not appear as a source operand of the conditional branch instruction and which
is indirectly modified by the compare instruction. Whenever the programmer reverse executes the
program from point (3) to point (2) (i.e., the conditional branch instruction is reverse executed but the
compare instruction is not), the debugger tool re-executes the compare instruction in the background.
This guarantees that when the program is forward executed from point (2) on, the outcome of the
conditional branch instruction will always be the same, even if the value of the branch condition register
has been modified prior to reverse execution. 2
56
