Abstract. Translation validation (TV) is the process of proving that the execution of a translator has generated an output that is a correct translation of the input. When applied to optimizing compilers, TV is used to prove that the generated target code is a correct translation of the source program being compiled. This is in contrast to verifying a compiler, i.e. ensuring that the compiler will generate correct target code for every possible source program -which is generally a far more difficult endeavor. This paper reviews the TVOC framework developed by Amir Pnueli and his colleagues for translation validation for optimizing compilers, where the program being compiled undergoes substantional transformation for the purposes of optimization. The paper concludes with a discussion of how recent work on the TV of software pipelining by Tristan & Leroy can be incorporated into the TVOC framework.
Introduction
Verifying a compiler to ensure that it will produce correct target code every time it compiles a source program is a very difficult undertaking. First, compilers are large pieces of software and, given the current state of the art, verifying large pieces of software is still generally computationally intractable. Second, compilers tend to undergo updates and new releases, which would require re-verification each time.
As a proposed solution to the difficulty of verifying that a compiler will produce correct target code for any possible source program, starting in 1998 Amir Pnueli and his colleagues [10, 9, 11, 8, 12] proposed translation validation (TV), which is the process of verifying, for a given run of the compiler, that the target code produced during the run is a correct translation of the source program being compiled. Initially, the TV work was performed for a compiler that translated SIGNAL, a reactive language with very simple program structure (a single outer loop), into C. This work was followed up by Pnueli and various colleagues (including this author), as well by many other researchers, who developed TV methods for industrial-strength optimizing compilers (see, e.g. [7, 2, 15, 17] among too many to list).
Performing TV for optimizing compilers is especially challenging because the optimizations performed by the compiler can significantly change the structure of a program. The TV for optimizing compilers work performed by Pnueli and colleagues resulted in a framework and implementation called TVOC [2] , for Translation Validation for Optimizing Compilers, which partitions compiler optimizations into two categories:
-Structure-preserving optimizations: These are optimizations that do not radically change the structure of the program, so that a mapping between states of the target program and states of the source program is still possible. Examples of such optimizations include the so-called "global optimizations", such as dead-code elimination, constant folding and propagation, and common subexpression elimination. -Structure-modifying optimizations: These are optimizations that radically change the structure of a program -or at least parts of the program, such as loops -so that there is no useful mapping between states of the target program and states of the source program. Examples of these optimizations include loop optimizations such as loop interchange, tiling, reversal, fusion, and distribution.
These two categories are treated differently within the TVOC framework. In both cases, based on the source and target programs and the optimizations performed, the TVOC system generates verification conditions that are then checked by a theorem prover. The theorem prover that TVOC uses is CVC [14, 1] , which is an automatic theorem prover for proving the validity of first-order formulas and has a large number of built-in theories that are useful for TV (e.g. integers, arrays, bitvectors, etc.). The latest instantiation of CVC is CVC3 [1] . If CVC determines that the verification conditions that TVOC generates are satisfied, then the optimizations applied by the compiler were correct. Otherwise, the TVOC system indicates that the compilation was invalid. Figure 1 shows a simple schematic of the TVOC system, as applied to the Intel Open Research Compiler a few years ago. After parsing and type checking (which the TVOC system does not validate), the compiler performs loop optimizations, global optimizations, and some machine dependent optimizations prior to code generation. Each optimization phase comprises one or more IRto-IR transformations, taking the program in an intermediate reprentation (IR) and producing a new program represented in the same IR language. Based on these transformations, TVOC produces the set of verification conditions that are fed to the CVC theorem prover. Figure 2 shows a slightly more detailed schematic of the TVOC system. There are separate components of TVOC for validating loop optimizations (structure Loop optimizations are generally performed earlier in the compilation process than global optimizations, since loop optimizations often expose opportunities for global optimization. In any case, these optimization processes tend to be iterative. The input (source) and output (target) of each optimization is fed to the appropriate module (loop TV or global TV) of TVOC, which generates the verification conditions to be fed to CVC.
We have recently begun to extend the TVOC framework, although not the implementation yet, to handle machine-dependent optimizations. One such optimization that has not been handled by TVOC, although it was addressed in other Pnueli work, is software pipelining. Recent work by Tristan & Leroy [16] for validating software pipelining using symbolic evaluation is being been adapted for the TVOC framework (i.e. using CVC). We describe here how software pipelining fits into the TVOC framework.
Validating global optimizations in TVOC
Global optimizations are structure preserving in the sense that they preserve the structure of a program sufficiently to permit a mapping between states of the target program (i.e. the IR representation of the program after an optimization) and states of the source (i.e. the IR representation of the program before the optimization). Although a detailed explanation of how validation of global optimizations are performed in TVOC is beyond the scope of this paper, we provide a brief description here. We refer the reader to [18] for more details and examples.
In order to validate a translation from a source program S to a target program T , where the transformations applied to S are structure-preserving, TVOC represents each program as a transition system [10] (TS), which is a state machine consisting of a set V of state variables, a set O ⊆ V of observable variables, an initial condition Θ characterizing the initial states of the system, and a transition relation ρ relating each state to its possible successors. The variables are typed, and a state of a TS is a type-consistent interpretation of the variables. A computation of a TS is defined to be a maximal finite or infinite sequence of states starting with a state that satisfies the initial condition such that every two consecutive states are related by the transition relation.
In order to establish that P T , the TS representing the target program T , is a correct translation of P S , the TS representing the source program S, we use a proof rule, Val, which is inspired by the computational induction approach [3] , originally introduced for proving properties of a single program. Rule Val provides a proof methodology by which one can prove that one program refines another. This is achieved by establishing a control mapping from target to source locations, a data abstraction mapping from source variables to expressions over the target variables, and proving that these abstractions are maintained along basic execution paths of the target program.
In Val, each TS is assumed to have a cut-point set, i.e., a set of blocks that includes all initial and terminal blocks, as well as at least one block from each of the cycles in the programs' control flow graph. A simple path is a path connecting two cut-points, and containing no other cut-point as an intermediate node. For each simple path, we can (automatically) construct the transition relation of the path. Typically, such a transition relation contains the condition which enables this path to be traversed and the data transformation effected by the path.
Rule Val constructs a set of verification conditions, one for each simple target path, whose aggregate consists of an inductive proof of the correctness of the translation between source and target. Roughly speaking, each verification condition states that, if the target program can execute a simple path, starting with some conditions correlating the source and target programs, then at the end of the execution of the simple path, the conditions correlating the source and target programs still hold. The conditions consist of the control mapping, the data mapping, and, possibly, some invariant assertion holding at the target code.
Validating loop optimizations
The Val rule discussed above relied on there being a mapping between the states of the source and target programs. However, there is a class of loop optimizations that optimizing compilers perform that modify the structure of loops sufficiently so that no such mapping is possible. Thus, Pnueli and his colleagues, including this author, developed and implemented in TVOC a method for validating loop optimizations that did not rely on such a mapping. We describe this method briefly here, but refer the reader to [2] .
The loop optimizations that TVOC handles fall under the category of reordering transformations, which are transformations that change the order of execution of statements in the body of a loop, but do not change the number of times each statement is executed. Reordering transformations cover many of the loop optimizations performed by optimizing compilers, including fusion, distribution, reversal, interchange, and tiling.
To illustrate TVOC's validation of loop optimizations, we consider loop interchange. The loop interchange optimization reorders the nesting of a nested loop. Figure 3 shows an example of a loop interchange on a doubly-nested loop. For this example, the transformation may provide several performance benefits. First, since the expression Y[i2] is loop invariant in the inner loop of the transformed code, the computation of the address denoted by Y[i2] and the fetching of its value can be moved outside the inner loop. Second, if the array A is arranged in row major form, where adjacent elements in a row occupy consecutative locations in memory, then cache performance is likely to be improved. The illustration below shows the transformation of the access pattern over the array A caused by the interchange. In order to define a single rule for validating all reordering loop transformations, we represent a loop of the form
where i = (i 1 , . . . , i m ) is the list of nested loop indices, I is the set of the values assumed by i through the different iterations of the loop, and B represents the entire body of the loop. The set I can be characterized by a set of linear inequalities. For example, for the above loop, I is defined by
The relation ≺ I is the ordering by which the various points of I are traversed. For example, for the loop above, this ordering is the lexicographic order on I.
In general, a loop transformation has the form:
Such a transformation may change the domain of the loop indices from I to J , change the loop indices from i to j, and possibly introduce an additional linear transformation in the loop's body, changing it from the source loop body B(i) to the target body B (F (j) ). The rule used in TVOC to validate loop transformations is the Permute rule shown in Figure 4 , where F is a bijection (i.e. it is one-to-one and onto) mapping iterations in the transformed loop back to iterations in the original loop.
Fig. 4: Permutation Rule Permute for reordering transformations
Intuitively, the Permute rule says that if, for any circumstance under which a reordering transformation switches the relative order of two iterations i 1 and i 2 in the source and target code, it is case that executing the body B in iteration i 1 followed by executing B in iteration i 2 is equivalent to executing B in iteration i 2 followed by executing the body in iteration i 1 , then the reordering transformation is correct.
In order to apply rule Permute to a given case, it is necessary to identify the function F (and F −1 ) and verify that the antecedent of Rule Permute is satisfied. The identification of F can be provided by the compiler, once it determines which of the relevant loop optimizations it chooses to apply. Intel's ORC compiler generates a file containing a description of the loop optimizations applied in the current phase of optimization. TVOC extracts this information (identified as "optimization spec" in Figure 2 ), verifies that the optimized code has resulted from the indicated optimization, and constructs the verification conditions. These conditions are then passed to CVC, which checks them automatically.
Consider the interchange example shown in Figure 3 . The loop interchange transformation for that example can be characterized as follows:
and ≺ I and ≺ J are lexicographic ordering on their respective iteration spaces. The functions F and F −1 associated with loop interchange are defined by
In order to determine if loop interchange is valid on the example loop, the definitions of I, J , ≺ I , ≺ J ,F , F −1 , and the loop body B are plugged into the antecedent of the Permute rule, namely
The resulting formula is then fed to CVC to determine if it is valid. If it is valid, then loop interchange optimization is correct for this example. For those cases where the compiler does not indicate the loop transformations that were applied, TVOC uses a set of heuristics figure out which transformations were used.
Validating Software Pipelining
Machine-dependent optimizations, such as software pipelining, are not yet handled by the TVOC implementation. In this section, we discuss how TV for software pipelining can be incorporated into TVOC, based on recent work by Tristan & Leroy [16] . We start, however, with a intuitive explanation of the software pipelining optimization.
A gentle introduction to software pipelining
Software pipelining [13, 5] refers to a class of optimizations that improve program performance by overlaying iterations of a loop -essentially allowing an iteration to start before the previous iteration has completed, even if there are dependences between iterations that prohibit the iterations executing fully in parallel. Software pipelining can be view schematically as:
The benefits of software pipelining include 1) exploiting instruction-level parallelism by allowing instructions from different iterations to execute simultaneously on VLIW or superscalar machines, 2) filling delay slots in one iteration with instructions from other iterations, and 3) other improvements (register allocation, cache performance, etc.) that can be made during instruction scheduling by being able to select among instructions from several overlapping iterations.
Although software pipelining generally occurs at the instruction-scheduling phase of compilation, where the optimization is applied to machine instructions, for clarity we will show the examples in this paper in an intermediate representation (IR) that is fairly close to the source.
Consider the following simple loop: We assume that the load instruction, x = a[i-3], takes an extra cycle due to the memory fetch, thus a NOP ("no-op") is inserted to ensure that x is not referenced too early 1 . In the sequential execution of the loop, a cycle is wasted by the NOP during every iteration.
The figure below illustrates the execution of overlaid iterations in a software pipeline. These iterations continue executing as long as specified by the loop bounds.
As can be seen by close examination of the above figure, the actual pipeline code is accomplished by replicating the body of the loop four times, creatng a total of four instances of the variables i and x, and then overlaying the four iterations.
During execution, these four iterations are repeatedly executed, as implied by the figure above.
In a software pipeline, such as the one illustrated above, the instructions appearing on the same horizontal level -despite being from different iterations -can be executed simultaneously or in any order chosen by the compiler. Thus, although the NOP appears in the figure, it does not consume a cycle since there are other instructions that can be executed in that same cycle.
Upon further examination of the above figure, it can be seen that horizontal blocks of code are repeated in the execution of the overlaid iterations. This is shown in the figure below, where the code within the first large rectangle is repeated in the second rectangle (which is only partially visible) and many times subsequently.
The horizontal block of code within the large rectangle is called the "kernel" of the pipeline. Only one instance of the kernel code is actually generated, and is then executed in a loop. The figure below illustrates the repeated execution of the kernel code, preceded by a set of instructions called the "prologue" and followed by the set of instructions called the "epilogue". The prologue can be thought of as a "ramping up" of the pipeline and the epilogue as a "ramping down" of the pipeline.
For clarity, the above figure doesn't show the number of times that the kernel is executed. It can be seen from inspection that, together, the prologue and epilogue corresponds to executing three iterations of the original loop body (note the three assignments to x, the three writes to a[ ], etc.) and that the kernel code corresponds to four iterations of the original loop body. Thus, since the original loop executed N times, it must be the case that N is at least 3, since the prologue and epilogue will always execute once the pipelined code is entered. Furthermore, the value of N − 3, i.e. the number of iterations of the original loop that is executed by iterating over the kernel, must be divisible by four since each iteration of the kernel corresponds to four iterations of the original loop.
Using this logic, and the notation from [16] , it is clear that, in general, if the prologue and epilogue together execute µ iterations of the original loop and each iteration of the kernel executes δ iterations of the original loop, then we require that N ≥ µ and that (N − µ) is a multiple of δ. To enforce these requirements, the pipeline code is generally preceded by a conditional that tests the value of N , unless N can be determined statically. If N < µ, then the pipeline code will not be entered at all. If N −µ is not a multiple of δ, then the appropriate number (i.e. (N −µ) MOD δ ) of iterations of the loop are peeled off and executed separately, so that the remaining iterations of the loop can be pipelined.
Validating a software pipeline
In [6] , Pnueli and Leviathan described a method for validating software pipelining using an extension of the Val rule described above. This work used a mapping between transition systems resulting in a fairly complicated method.
In a recent POPL paper [16] , Tristan & Leroy describe a less complicated approach, defining a simple rule to be satisfied in order to deem that the translation from the original loop into a pipeline is correct. As their paper discusses, given a source loop with a body B that is translated into the pipeline consisting of a prologue P , a kernel S, and an epilogue E, where E and P together represent µ iterations of B and S represents δ iterations of B, the translation is correct iff
That is, executing the body B of a loop N times is equivalent to executing the prologue P , followed by iterating over the kernel S for (N − µ)/δ times, followed by the epilogue E. As discussed above, it is assumed (and enforced by other code) that N ≥ µ and that (N − µ) is a multiple of δ. Tristan & Leroy noted, though, that without knowledge of N , which is a run-time value, proving the above equivalence for all possible N is very difficult. Thus, they proposed a simple rule that is sound but not complete, in that if the rule is satisfied, then the translation is correct, but there may be correct translations that do not satisfy the rule. However, their paper states that such cases don't arise in practice. The Tristan & Leory rule can be specified as follows: Suppose a source loop whose body is B is translated into the pipeline consisting of a prologue P , a kernel S, and an epilogue E, where E and P together represent µ iterations of B and S represents δ iterations of B. Then,
where it is assumed that N ≥ µ and (N − µ) is a multiple of δ.
As shown in their POPL paper, the Tristan & Leroy rule is easy to prove inductively (once a framework, such as their symbolic evaluation, is developed for reasoning about equivalence -which is not so easy). Informally, the induction proceeds as follows. Since N − µ is divisble by δ, N = µ + mδ for some m ≥ 0. m is used as the basis of the induction.
Base Case m = 0:
Thus, for any m, B µ+mδ ∼ P ; S m ; E and since N = µ + mδ, i.e. m = (N − u)/δ, B N ∼ P ; S (N −µ)/δ ; E. The intuition behind the Tristan & Leroy rule can be seen in figure 5 , which is adapted (with permission) from Figure 3 in [16] . The horizontal sequence at the [16] top of the figure represents the execution of the original code and the sequence at the bottom is the execution of the pipelined code.
In their POPL paper, Tristan & Leroy describe a symbolic evaluation method for proving the equivalences (B µ ∼ P ; E) and (E; B δ ∼ S; E) for a particular source loop body B and target pipeline components P , S, and E. Instead, we have incorporated the Tristan & Leroy rule into the TVOC framework, where it is used to generate two verification conditions -simply (B µ ∼ P ; E) and (E; B δ ∼ S; E) -that are fed to CVC theorem prover, along with the code for B µ , B δ , P , S, and E. If CVC finds the two conditions valid, then pipelining is correct. Figure 6 shows the original and pipelined loops of our example program, above, along with the verification conditions, encoded for CVC. P , S, and E in the CVC code resulted from an SSA transformation applied to the pipeline code. B 3 and B 4 , corresponding to B µ , B δ , respectively, were generated by static loop unrolling and then an SSA transformation. Equivalence between B 3 and P ; E and between E; B 4 and S; E is checked in CVC by asserting that their inputs (the initial values of a, the i's, and the x's) are equal and querying CVC about the equality of the their outputs (i.e. the final values of a, the i's, and the x's).
Software pipelining, although an optimization that can be complicated to perform, lends itself nicely to simple translation validation rules, such as the Tristan & Leroy rule, because none of the pipeline prologue, kernel, or epilogue themselves contain loops or branches. Although the compiler has freedom to rearrange instructions within each of these blocks, the resulting code will still be amenable to equivalence checking by a theorem prover.
Future Work: Validating pipelining that uses hardware support
In practice, compilers that perform software pipelining often generate code for machines, such as the Intel IA64, that provides substantial hardware support for pipelining. This hardware support includes rotating registers to provide automatic renaming of variables (such as the loop index i in our example above) across iterations -thus avoiding replicating identical code in overlapping iterations and reducing the size of the kernel code. Another form of hardware support for software pipelining is predication, which is the ability to turn off the execution of certain instructions at run time. Predication, in this case, supports the execution of prologue and epilogue code -which are subsets of the kernel instructions -by turning off certain instructions in the kernel during the ramp up and ramp down phases of the pipeline. As described in [4] , predication can also be used to dynamically alter the software pipeline in order to preserve loop-carried dependences that can only be computed at run time.
Techniques for translation validation of software pipelining that use such hardware support have not yet been developed. As with performing TV for other kinds of machine-dependent optimizations, it will involve encoding the hardware features of the machine in a logical framework (e.g. as a set of CVC assertions).
