Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the MLIR proposed by Google for deep learning. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations.
Introduction
Intermediate representations (IRs) are at the heart of every modern compiler. These data structures represent programs throughout compilation, connect individual compiler stages, and provide abstractions to facilitate the implementation of analyses, optimizations, and program transformations. A suitable IR highlights and exposes program properties that are important to the transformations in a specific compiler stage. This reduces the complexity of optimizations and simplifies their implementation.
Modern computer systems have become increasingly parallel and specialized as system designers strive to improve their computational power. In order to take full advantage of these systems, optimizing compilers need to expose a program's available parallelism and be able to work at multiple abstraction layers. This has led to an emerging interest in developing more efficient IRs for exposing the necessary information, as exemplified by MLIR [38] , an IR proposed by Google for deep learning.
Data flow centric IRs, such as the Value (State) Dependence Graph (V(S)DG) [45, 18, 21] , have emerged as a promising class of IRs for optimizing compilers. These IRs are based on the observation that many optimizations require data flow rather than control flow information, and shift the focus to explicitly expose data instead of control flow. They represent programs in demand-dependence form, encode structured control flow, and explicitly model data flow between operations. This raises the IR's abstraction level, permits simple and powerful implementations of data flow optimizations, and helps to expose the inherent parallelism in programs [21, 18, 40] . However, the shift in focus from explicit control flow to only structured and implicit control flow requires more sophisticated construction and destruction methods [45, 21, 39] . In this context, Bahmann et al. [3] presents the Regionalized Value State Dependence Graph (RVSDG) and conclusively addresses the problem of intra-procedural control flow recovery for demand-dependence graphs. They show that the RVSDG's restricted control flow constructs do not limit the complexity of the recoverable control flow.
In this work, we are concerned with the aspects of whole-program representation in the RVSDG. We present the required RVSDG constructs, consider construction and destruction at the program level, and show feasibility and practicality of this IR for optimizations by providing a practical compiler implementation. Specifically, we make the following contributions:
1. A complete RVSDG specification, including intra-and inter-procedural constructs.
2.
A complete description of RVSDG construction and destruction, augmenting the previously proposed algorithms with the construction and destruction of inter-procedural constructs, as well as the handling of intra-procedural dependencies during construction.
A presentation of Dead Node Elimination (DNE) and
Common Node Elimination (CNE) optimizations to demonstrate the RVSDG's utility. DNE combines dead and unreachable code elimination, as well as dead function removal. CNE permits the removal of redundant computations by detecting congruent operations. 4 . A publicly available [32] prototype compiler that implements the discussed concepts. It consumes and produces LLVM IR, and is to our knowledge the first optimizing compiler that uses a demand dependence graph as IR.
5. An evaluation of the RVSDG in terms of performance and size of the produced code, as well as compile time and representational overhead.
Our results show that the RVSDG can produce competitive code and that it can serve as the IR in a compiler's optimization stage. This work paves the way for further exploration of the RVSDG's properties and their effect on optimizations and analyses, as well as its usability in code generation for dataflow and parallel architectures.
Motivation
Contemporary optimizing compilers are predominantly based on imperative program representations in the form of control flow graphs or variants. These representations preserve the sequential nature of the input program and implicitly convey some of the semantics associated with the sequential execution model (e.g., order of access through potentially aliased references). In case of LLVM, the representation is based on the instruction set of a virtual CPU with operation semantics closely matching that of real CPUs. This choice of representation is somewhat at odds with the requirements of code optimization analysis, which is often focused on data dependence instead: As Table 1 illustrates, the majority of optimization passes executed are concerned with data flow analysis (in the form of SSA construction and interpretation, or in-memory data structures in the form of alias analysis and/or memory SSA).
We propose the (data-)dependence centric RVSDG as an alternative. While it requires considerably more effort to construct the RVSDG from an imperative program as well as to recover the necessary Table 1 : Thirteen most invoked LLVM 7.0.1 passes at O3. Optimization # Invocations 1. Alias Analysis (-aa) 19 2. Basic Alias Analysis (-basicaa) 18 3. Optimization Remark Emitter (-opt-remark-emitter) 15 4 . Natural Loop Information (-loops) 14 5 . Lazy Branch Probability Analysis (-lazy-branch-prob) 14 6 . Lazy Block Frequency Analysis (-lazy-block-freq) 14 7 . Dominator Tree Construction (-domtree) 13 8. Scalar Evolution Analysis (-scalar-evolution) 10 9. CFG Simplifier (-simplifycfg) control flow for code generation, we believe that this cost is more than offset by the benefits provided to the analyses and optimization stages. The following sections illustrate this hypothesis by examples.
Simplified Compilation by Strong Representation Invariants
The Control Flow Graph (CFG) in Static Single Assignment (SSA) form [10] is the predominant IR for optimizations in modern imperative language compilers [41] . Its nodes represent a list of totally ordered operations and its edges a program's possible control flow paths, permitting efficient control flow optimizations and simple code generation. The CFG's translation to SSA form improves the efficiency of many data flow optimizations [34, 44] . Figure 1a shows a function with a simple loop and a conditional, and Figure 1b shows the corresponding CFG in SSA form. This form is however not an intrinsic property of the CFG, but a specialized variant that needs to be actively maintained. Various compiler passes, such as jump threading or live-range splitting, may perform transformations that cause the CFG to no longer satisfy this form. As shown in Table 1 , LLVM requires SSA restoration [7] in 14 different passes.
Moreover, CFG-based compilers must constantly (re-)discover and canonicalize loops, or establish various invariants besides SSA form. Table 1 shows that six of the 13 most invoked passes in LLVM are helper passes only performing such tasks. They amount to 23% of all pass invocations. This lack of enforced invariants complicates the implementation of optimizations and analyses, increases engineering effort, unnecessarily prolongs compilation time, and leads to compiler bugs [22, 23, 24] .
In contrast, the RVSDG is always in strict SSA form as edges connect each operand input to only one output. It explicitly exposes desirable program structures, such as loops, in a tree structure (Section 4), similarly to the Program Structure Tree [19] . This eliminates the need for SSA restoration and the other helper passes from Table 1 . Figure 1c shows the RVSDG corresponding to Figure 1a . It is an acyclic demand-dependence graph where nodes represent simple operations or control flow constructs, and edges represent the dependencies between computations (see Section 4). In Figure 1c , simple operations are colored yellow, conditionals are green, loops are red, and functions are blue. 
Unified Representation of Different Levels of Program Structures
While the CFG can represent a single procedure, representation of programs as a whole requires additional data structures such as call graphs. The RVSDG can represent an entire program as a single data structure where a def-use dependency of one function on another is modeled the same way as the def-use dependency of scalar quantities. This makes it possible to apply the same program transformation at multiple levels resulting in a considerably smaller number of transformation passes and algorithms, e.g., unreachable code and dead function analysis turns out to be essentially the same as dead variable analysis (Section 6.1).
Strongly Normalized Representation
The RVSDG program representation is much more strongly normalized than control flow representations. Programs differing only in the ordering of (independent) operations result in the same RVSDG representation, loops and conditionals always take a single canonical form. This normalization already simplifies the implementation of transformations [45, 18, 21] and eliminates the need for (repeated) compiler analysis passes such as loop detection. Some common program optimizing transformations take a particular simple form in the RVSDG representation. For example, Figure 1d shows the optimized RVSDG of Figure 1c , illustrating some of these optimizations: The inputs to the "upper left" plus operation are easily recognized as loop invariant because their "loop entry ports" connect directly to the corresponding "loop exit ports" (operations, ports, and edges highlighted in purple). A simple push strategy allows to recursively identify data dependent operations as invariant and hoist them out of the loop: The addition and subtraction computing li1 and li2 are moved out of the loop (theta) as their operands, i.e. b, c, and d, are loop invariant (all three of them connect the entry of the loop to the exit). Similarly, the shift operation common to both conditional branches is hoisted and combined, while the division operation is moved into the conditional as it is only used in one alternative. In contrast to CFG-based compilers, all these optimizations are performed directly on the unoptimized RVSDG of Figure 1c and can be performed in a single regular pass. No additional data structures or helper passes are required. See also Section 6 for further details.
Exposing Independent Computations
CFGs implicitly represent a single global machine state by sequencing all operations that could affect it. While RVSDG can follow the same model, it is not actually limited to this interpretation. The RVSDG can instead model the system as consisting of multiple independent states. The code in Figure 1e is used to illustrate this concept: The depicted function contains two non-aliasing store operations (pointing to memory objects of incompatible types) and two independent loops.
In a CFG, both stores and loops are strictly ordered. Their mutual independence needs to be established by explicit compiler passes (and may need to be re-established multiple times during the compilation process as the number of alias analysis passes in Table 1 illustrate) and represented using auxiliary data structures and/or annotations. In contrast, the RVSDG permits the encoding of such information directly in the graph, as shown in Figure 1f . Disjoint memory regions (consisting of int-typed and float-typed memory objects) are modeled as disjoint states, exposing the independence of affecting operations in the representation. RVSDG can in principle go even further in representing a memory SSA form that is not formally any different from value SSA form, enabling the same kind of optimizations to be applied to both.
Summary
The RVSDG raises the IR abstraction level by enforcing desirable properties, such as SSA form, explicitly encoding important structures, such as loops, and relaxing the overly strict order of the input program. This leads to a more normalized program representation and avoids many idiosyncrasies and artifacts from other IRs, such as the CFG, and further helps to expose parallelism in programs.
Related Work
A cornucopia of IRs has been presented in the literature to better expose desirable program properties for optimizations. For the sake of brevity, we restrict our discussion to the most prominent IRs, only highlighting their strengths and weaknesses in comparison to the RVSDG, and refer the reader to Stanier et al. [41] for a more complete overview.
Control (Data) Flow Graph
The Control Flow Graph (CFG) [1] exposes the intra-procedural control flow of a function. Its nodes represent basic blocks, i.e., an ordered list of operations without branches or branch targets, and its edges represent the possible control flow paths between these nodes. This explicit exposure of control flow simplifies certain analyses, such as loop identification or irreducibility detection, and enables simple target code generation. The CFG's translation to SSA form [10] , or one of its variants, such as gated SSA [43] , thinned gated SSA [14] , or future gated SSA [12] , additionally improves the efficiency of data flow optimizations [44, 34] . These properties along with its simple construction from a language's abstract syntax tree made the CFG in SSA form the predominant IR for imperative language compilers [41] , such as LLVM [20] and GCC [9] . However, the CFG has also been criticized as an IR for optimizing compilers [13, 17, 18, 21, 45, 47, 46 ]: 1. It is incapable of representing inter-procedural information. It requires additional IRs, e.g., the call graph, to represent such information.
2. It provides no structural information about a procedure's body. Important structures, such as loops, and their nesting needs to be constantly (re-)discovered for optimizations, as well as normalized to make them amenable for transformations.
3. It emphasizes control dependencies, even though many optimizations are based on the flow of data. This is somewhat mitigated by translating it to SSA form or one of its variants, but in turn requires SSA restoration passes [7] to ensure SSA invariants.
4.
It is an inherently sequential IR. The operations in basic blocks are listed in a sequential order, even if they are not dependent on each other. Moreover, this sequentialization also exists for structures such as loops, as two independent loops can only be represented in sequential order. Thus, the CFG is by design incapable of explicitly encoding independent operations.
5.
It provides no means to encode additional dependencies other than control and true data dependencies. Other information, such as loop-carried dependencies or alias information, must regularly be recomputed and/or memoized in addition to the CFG.
The Control Data Flow Graph (CDFG) [27] tries to mitigate the sequential nature of the CFG by replacing the sequence of operations in basic blocks with the Data Flow Graph (DFG) [11] , an acyclic graph that represents the flow of data between operations. This relaxes the strict ordering within a basic block, but does not expose instruction level parallelism beyond basic block boundaries or between program structures.
Program Dependence Graph/Web
The Program Dependence Graph (PDG) [13, 15] combines control and data flow within a single representation. It features data and control flow edges, as well as statement, predicate, and region nodes. Statement nodes represent operations, predicate nodes represent conditional choices, and region nodes group nodes with the same control dependency. If a region's control dependencies are fulfilled, then its children can be executed in parallel. Horwitz et al. [16] extended the PDG to model inter-procedural dependencies by incorporating procedures into the graph.
The PDG improves upon the CFG by employing region nodes to relax the overly restrictive sequence of operations. This relaxed sequence combined with the unified representation of data and control dependencies simplifies complex optimizations, such as code vectorization [4] or the extraction of threadlevel parallelism [29, 36] . However, the unified data and control flow representation results in a large number of edge types, five in Ferrante et al. [13] and four in Horwitz et al. [15] , which need to be maintained to ensure the graph's invariants. The PDG suffers from aliasing and side-effect problems, as it supports no clear distinction between data held in register and memory. This complicates or can even preclude its construction altogether [18] . Moreover, program structure and SSA form still need to be discovered and maintained.
The Program Dependence Web (PDW) [28] extends the PDG and gated SSA [43] to provide a unified representation for the interpretation of programs using control-, data-, or demand-driven execution models. This simplifies the mapping of programs written in different paradigms, such as the imperative or functional paradigm, to different architectures, such as Von-Neumann and dataflow architectures. In addition to the elements of the PDG, the PDW adds µ nodes to manage initial and loop-carried values and η nodes to manage loop-exit values. Campbell et al. [5] further refined the definition of the PDW by replacing µ nodes with β nodes and eliminating η nodes. As the PDW is based on the PDG, it suffers from the same aliasing and side-effect problems. PDW's additional constructs further complicate graph maintenance and its construction is elaborate, requiring three additional passes over a PDG, and is limited to programs with reducible control flow.
Value (State) Dependence Graph
The Value Dependence Graph (VDG) [45] abandons the explicit representation of control flow and only models the flow of values using ports. Its nodes represent simple operations, the selection between values, or functions, using recursive functions to model loops. The VDG is implicitly in SSA form and abandons the sequential order of operations from the CFG, as each node is only dependent on its values. However, modeling only data flow between stateful computations raises a significant problem in terms of preservation of program semantics, as the "evaluation of the VDG may terminate even if the original program would not..." [45] .
The Value State Dependence Graph (VSDG) [17, 18] addresses the VDG's termination problem by introducing state edges. These edges are used to model the sequential execution of stateful computations. In addition to nodes for representing simple operations and selection, it introduces nodes to explicitly represent loops. Like the VDG, the VSDG is implicitly in SSA form, and nodes are solely dependent on required operands, avoiding a sequential order of operations. However, the VSDG supports no interprocedural constructs, and its selection operator is only capable of selecting between two values based on a predicate. This complicates destruction, as selection nodes must be combined to express conditionals. Even worse, the VSDG represents all nodes as a flat graph, which simplifies optimizations [18] , but has a severe effect on evaluation semantics. Operations with side-effects are no longer guarded by predicates, and care must be taken to avoid duplicated evaluation of these operations. In fact, for graphs with stateful computations, lazy evaluation is the only safe strategy [21] . The restoration of a program with an eager evaluation semantics complicates destruction immensely, and requires a detour over the PDG to arrive at a unique CFG [21] . Zaidi et al. [46, 47] adapted the VSDG to spatial hardware and sidestepped this problem by introducing a predication-based eager/dataflow semantics. The idea is to effectively enforce correct evaluation of operations with side-effects by using predication. While this seems to circumvent the problem for spatial hardware, it is unclear what the performance implications would be for conventional processors.
The RVSDG solves the VSDG's eager evaluation problem by introducing regions. These regions enable the modeling of control flow constructs as nested nodes, and the guarding of operations with sideeffects. This avoids any possibility of duplicated evaluation, and in turn simplifies RVSDG destruction. Moreover, nested nodes permit the explicit encoding of a program's hierarchical structure into the graph, further simplifying optimizations.
The Regionalized Value State Dependence Graph
A Regionalized Value State Dependence Graph (RVSDG) is an acyclic hierarchical multigraph consisting of nested regions. A region R = (A, N, E, R) represents a computation with argument tuple A, nodes N , edges E, and result tuple R, as illustrated in Figure 2a . A node can be either simple, i.e., it represents a primitive operation, or structural, i.e., it contains regions. Each node n ∈ N has a tuple of inputs I and outputs O. In case of simple nodes, they correspond to arguments and results of the represented operation, whereas for structural nodes, they map to arguments and results of the contained regions. For nodes n 1 , n 2 ∈ N , an edge (g, u) ∈ E connects either output g ∈ O n1 or argument g ∈ A to either input u ∈ I n2 or result u ∈ R of matching type. We refer to g as the origin of an edge, and to u as the user of an edge. Every input or result is the user of exactly one edge, whereas outputs or arguments can be the origins of multiple edges. All inputs or results of an origin are called its users. The corresponding node of an origin is called its producer, whereas the corresponding node of a user is called consumer. Correspondingly, the set of nodes of all users of an origin are referred to as its consumers. The types of inputs and outputs are either values, representing arguments or results of computations, or states, used to impose an order on operations with side-effects. A node's signature are the types of its inputs and outputs, whereas a region's signature are the types of its arguments and results. Throughout this paper, we use n, e, i, o, a, and r with sub-and superscripts to denote individual nodes, edges, inputs, outputs, arguments, and results, respectively. We use g and u to denote an edge's origin and user, respectively. An edge e from origin g to user u is also denoted as e : (g, u), or short (g, u).
The RVSDG can model programs at different abstraction levels. It can represent simple data-flow graphs such as those used in machine learning frameworks, but it can also represent programs at the machine level as used in compiler back-ends for code generation. This flexibility makes it possible to use the RVSDG for the entire compilation pipeline. In this paper, we target an abstraction level similar to that of LLVM IR. This permits us to illustrate all of the RVSDG's features without involving architecturespecific details. The rest of this section defines the necessary constructs.
Nodes
Simple nodes model primitive operations such as addition, subtraction, load, and store. They have an operator associated with them, and a node's signature must correspond to the signature of its operator. Simple nodes map their input value tuple to their output value tuple by evaluating their operator with the inputs as arguments, and associating the results with their outputs. Figure 2b illustrates the use of simple nodes as well as value and state edges. Solid lines represent value edges, whereas dashed lines represent state edges. Nodes have as many value inputs and outputs as their corresponding operations demand. The ordering of the load and store nodes is preserved by sequentializing them with the help of a state edge. Structural nodes contain regions and can model structural program behavior such as the conditional or repeated evaluation of computations. We present six different kind of structural nodes: γ-nodes, which represent conditionals, θ-nodes, which represent tail-controlled loops, λ-nodes for procedures and functions, δ-nodes for global variables, φ-nodes for mutually recursive environments, and ω-nodes for translation units. The rest of this section discusses each structural node in detail and illustrates their usage.
Gamma-Nodes
A γ-node models a decision point and contains regions R 0 , ..., R k | k > 0 of matching signature. Its first input is a predicate, which determines the region under evaluation. It evaluates to an integer v with 0 ≤ v ≤ k. The values of all other inputs are mapped to the corresponding arguments of region R v , R v is evaluated, and the values of its results are mapped to the outputs of the γ-node.
γ-nodes represent conditionals with symmetric control flow splits and joins, such as if-then-else or switch statements without fall-throughs. Figure 2c shows a γ-node. It contains three regions: one for each case, and a default region. The map node takes the value of x as input and maps it to zero, one, or two, determining the region under evaluation. This region is evaluated and its result is mapped to the γ-node's output.
We define the entry variable of a γ-node as a pair of an input and the arguments the input maps to during evaluation, as well as the exit variable of a γ-node as a pair of an output and the results the output could receive its value from:
is the l-th entry variable of a γ-node with k regions. It consists of the l-th input and tuple A l−1 = {a R0 l−1 , ..., a R k l−1 } with the l − 1-th argument from each region. We refer to the set of all entry variables as EV .
Definition 2
The pair ex l = (R l , o l ) is the l-th exit variable of a γ-node with k regions. It consists of a tuple R l = {r R0 l , ..., r R k l } of the l-th result from each region and the l-th output they would map to. We refer to the set of all exit variables as EX.
Theta-Nodes
A θ-node models a tail-controlled loop. It contains one region that represents the loop body. The length and signature of its input tuple equals that of its output, or the region's argument tuple. The first region result is a predicate. Its value determines the continuation of the loop. When a θ-node is evaluated, the values of all its inputs are mapped to the corresponding region arguments and the body is evaluated. When the predicate is true, all other results are mapped to the corresponding arguments for the next iteration. Otherwise, the result values are mapped to the corresponding outputs. The loop body of an iteration is always fully evaluated before the evaluation of the next iteration. This avoids "deadlock" problems between computations of the loop body and the predicate, and results in well-defined behavior for non-terminating loops that update external state.
θ-nodes permit the representation of do-while loops. In combination with γ-nodes, it is possible to model head-controlled loops, i.e., for and while loops. Thus, employing tail-controlled loops as basic loop construct enables us to express more complex loops as a combination of basic constructs. This normalizes the representation and reduces the complexity of optimizations as there exists only one construct for loops. Another benefit of tail-controlled loops is that their body is guaranteed to execute at least once, enabling the unconditional hoisting of invariant code with side-effects. Figure 2d shows a θ-node with two loop variables, n and r, and an additional result for the predicate. When the predicate evaluates to true, the results for n and r of the current iteration are mapped to the region arguments to continue with the next iteration. When the predicate evaluates to false, the loop exits and the results are mapped to the node's outputs. We define a loop variable as a quadruple that represents a value routed through a θ-node:
It consists of the l-th input i l , argument a l , and output o l , and the l + 1-th result of a θ-node. We refer to the set of all loop variables as LV .
Lambda-Nodes
A λ-node models a function and contains a single region representing a function's body. It features a tuple of inputs and a single output. The inputs refer to external variables the λ-node depends on, and the output represents the λ-node itself. The region has a tuple of arguments comprised of a function's external dependencies and its arguments, and a tuple of results corresponding to a function's results.
An apply-node represents a function invocation. Its first input takes a λ-node's output as origin, and all other inputs represent the function arguments. In the rest of the paper, we refer to an apply-node's first input as its function input, and to all its other inputs as its argument inputs. Invocation maps the values of a λ-node's input k-tuple to the first k arguments of the λ-region, and the values of the function arguments of the apply-node to the rest of the arguments of the λ-region. The function body is evaluated and the values of the λ-region's results are mapped to the outputs of the apply-node. Figure 3a shows an RVSDG with two λ-nodes. Function f calls functions puts and max with the help of apply-nodes. The function max is part of the translation unit, while puts is external and must be imported (see the paragraph about ω-nodes for more details). We further define the context variable of a λ-node. A context variable provides the corresponding input and argument for a variable a λ-node depends on.
Definition 4
The pair cv l = (i l , a l ) is a λ-node's l-th context variable. It consists of the l-th input and argument. We refer to the set of all context variables as CV .
Definition 5
The λ-node connected to a function input is the callee of an apply-node, and an apply-node is the caller of a λ-node. We refer to the set of all callers of a λ-node as CLL. depends on, and the output represents the δ-node itself. The region has a tuple of arguments representing a global variable's external dependencies and a single result corresponding to its right-hand side value. Figure 3a shows an RVSGD with a δ-node. Function puts takes a string as argument that is the right-hand side of a global variable. Similarly to λ-nodes, we define the context variable of a δ-node. It provides the corresponding input and argument for a variable a δ-node depends on.
Delta-Nodes

Definition 6
The pair cv l = (i l , a l ) is a δ-node's l-th context variable. It consists of the l-th input and argument. We refer to the set of all context variables as CV .
Phi-Nodes
A φ-node models an environment with mutually recursive functions, and contains a single region with λ-nodes. Each single output of these λ-nodes serves as origin to a single result in the φ-region. A φ-node's outputs expose the individual functions to callers outside the φ-region, and must therefore have the same arity and signature as the results of the φ-region. The first input of an apply-node from outside the φ-region takes these outputs as origin to invoke one of the functions.
The inputs of a φ-node refer to variables that the contained functions depend on and are mapped to corresponding arguments in the φ-region when a function is invoked. In addition, a φ-region has arguments for each contained function. An apply-node from inside a φ-region takes these as origin to its function input.
φ-nodes permit a program's mutually recursive functions to be expressed in the RVSDG without the introduction of cycles. Figure 3b shows an RVSDG with a φ-node. The function f calls itself, and therefore needs to be in a φ-node to preserve the RVSDG's acyclicity. The region in the φ-node has one input, representing the declaration of f , and one output, representing the definition of f . The φ-node has one output so that f can be called from outside the recursive environment.
We define context variables and recursion variables. Context variables provide corresponding inputs and arguments for variables the λ-nodes from within a φ-region depend on. Recursion variables provide the argument and output an apply-node's function input connects to.
Definition 7
The pair cv l = (i l , a l ) is the l-th context variable of a φ-node. It consists of the l-th input and argument. We call the set of all context variables CV .
Definition 8 For a φ-node with n context variables, the triple rv l = (r l , a l+n , o l ) is the l-th recursion variable. It consists of the l-th result and l + n-th argument of the φ-region as well as the l-th output of the φ-node. We refer to the set of all recursion variables as RV .
Omega-Nodes
An ω-node models a translation unit. It is the top-level node of an RVSDG and has no inputs or outputs. It contains exactly one region. This region's arguments represent entities that are external to the translation unit and therefore need to be imported. Its results mark all exported entities in the translation unit. Figure 3a and 3b illustrate the usage of ω-nodes. The ω-region in Figure 3a has one argument, representing the import of function g, and one result, representing the export of function f.
The ω-region in Figure 3b has only one export for function f.
Edges
Edges connect node outputs or region arguments to a node input or region result, and are either value typed, i.e., represent the flow of data between computations, or state typed, i.e., impose an ordering on operations with side-effects. State edges are used to preserve the observational semantics of the input program by ordering its side-effecting operations. Such operations include memory read and writes, as well as exceptions.
In practice, a richer type system permits further distinction between different kind of values or states. For example, different types for fixed-and floating-point values helps to distinguish between these arithmetics, and a type for functions permits to correctly specify the output types of λ-nodes and the function input of apply-nodes.
Construction & Destruction
RVSDG construction and destruction are responsible for generating an RVSDG from an input program and reestablishing control flow for code generation, respectively. We present both stages with an Inter-Procedure Graph (IPG) and a CFG as input and output. The IPG is an extension of a call graph and captures all static dependencies between functions, incorporating not only those originating from (direct) calls, but also those from other references within a function. In the IPG, an edge from node n1 to node n2 exists, if the body of the function corresponding to n1 references the function represented by n2. The utilization of an IPG and a CFG permits a language-independent presentation of RVSDG construction and destruction.
Construction
RVSDG construction is responsible for mapping all constructs, concepts, and abstractions of an input language to the RVSDG. The mapping is language-specific and depends on the language's concrete features. For example, languages with possibly unstructured control flow, such as C or C++, cannot be mapped directly to the RVSDG and require the CFG as a stepping stone, while other languages, such as Haskell, permit a direct construction [31] . In this section, we present RVSDG construction for the former case as it supersedes the latter. Conceptually, RVSDG construction can be split in two phases: Inter-PT invokes Intra-PT for each function's body. Both phases interact with each other through a common symbol table. This table maps function and CFG variables to the corresponding RVSDG arguments or outputs, and every creation of a node or region triggers updates to this table. We omit these updates in our algorithm descriptions to avoid unnecessary cluttering.
Inter-Procedural Translation
Inter-PT converts all functions from the Inter-Procedure Graph (IPG) of a translation unit to λ-nodes. Figure 4b shows the IPG for the code in Figure 4a . The code consists of four functions, with function sum performing two indirect calls. The corresponding IPG consists of four nodes and three edges. All edges originate from node tot, as it is the only function that explicitly references other functions, i.e. sum for a direct call, and f and g to pass as argument. No edge originates from node sum, as the corresponding function does not explicitly reference any other functions, and the functions for the indirect calls are provided as arguments.
The RVSDG puts two constraints on the translation from an IPG. Firstly, mutually recursive functions are required to be created within φ-nodes to preserve the RVSDG's acyclicity. Secondly, Inter-PT must respect the calling dependencies of functions to ensure that λ-nodes are created before their apply-nodes. In order to embed mutually recursive functions into φ-nodes, we need to identify the strongly connected components (SCCs) in the IPG. We consider an SCC trivial, if it consists only of a single node with no self-referencing edges. Otherwise, it is non-trivial. Moreover, a trivial SCC might not have a CFG associated with it, and is therefore defined in another translation unit.
Algorithm I outlines the RVSDG construction from an IPG. It finds all SCCs and converts trivial SCCs to individual λ-nodes, while the λ-nodes created from non-trivial SCCs are embedded in φ-nodes. This satisfies the first constraint. The second constraint is satisfied by processing SCCs in topological order, creating λ-nodes before their apply-nodes. The identification and ordering of SCCs can be performed in a single step with Tarjan's algorithm [42] , which returns the identified SCCs in reverse topological order. Figure 4c shows the RVSDG after the application of Algorithm I to the IPG in Figure 4b . In addition to a function's arguments, Algorithm I adds a state argument and result to λ-regions (the red dashed line in Figure 4c ). This state is used to sequentialize stateful computations. Nodes representing operations with side-effects consume this state and produce a new state for the next node 1 .
Intra-Procedural Translation
The RVSDG puts several constraints on the translation of intra-procedural control and data flow. Firstly, it requires that the control flow only consists of constructs that can be translated to γ-and θ-nodes, i.e. it can only consist of tail-controlled loops and conditionals with symmetric control flow splits and joins. Secondly, the nesting and relation of these constructs to each other is required as the RVSDG is a hierarchical representation. Thirdly, it is necessary to know the data dependencies of these structures in order to construct γ-and θ-nodes. While these constraints are beneficial for optimizations by substantially simplifying their implementation, they render RVSDG construction non-trivial. This section's construction algorithm enables the translation of any data and control flow, irregardless of its complexity, to the RVSDG. It creates a λ-region from a function's body in four stages:
1. Control Flow Restructuring (CFR) restructures a function's CFG to make it amenable to RVSDG construction.
2. Structural Analysis constructs a control tree [26] from the restructured CFG, discovering the CFG's individual control flow regions.
3.
Demand Annotation annotates the discovered control flow regions with the variables that are demanded by the instructions within these regions.
4.
Control Tree Translation converts the annotated control tree into a λ-region.
CFR ensures the first requirement by translating a function's control flow to a form that is amenable to RVSDG construction. It restructures control flow to a form that enables the direct mapping of a CFG's control flow regions to the RVSDG's γ-and θ-nodes. CFR can be omitted for languages with limited control flow structures, such as Haskell or Scheme. Structural analysis ensures the second requirement by constructing a control tree from the CFG, exposing the control regions nesting and the relation to each other. Demand annotation fulfills the third requirement by annotating the control tree's nodes with their data dependencies. Finally, the annotated control tree can be translated to a λ-region. The rest of this section covers the four stages in detail.
Control Flow
Restructuring: CFR converts a CFG to a form that only contains tail-controlled loops and conditionals with properly nested splits and joins. This stage is only necessary for languages that support more complex control flow constructs, such as goto statements or short-circuit operators, but can be omitted for languages with more limited control flow. CFR consists of two interlocked phases: loop restructuring and branch restructuring. Loop restructuring transforms all loops to tail-controlled loops, while branch restructuring ensures conditionals with symmetric control flow splits and joins.
We omit an extensive discussion of CFR as it is detailed in Bahmann et al. [3] . In contrast to node splitting approaches [48] , CFR avoids the possibility of exponential code blowup [6] by inserting additional predicates and branches instead of cloning nodes. Moreover, it does not require a CFG in SSA form as this form is automatically established throughout construction. -Branch Region: An subgraph with the entry and exit node representing the control flow split and join, respectively, and each branch alternative consisting of a single node.
-Loop Region: A single node where an edge originates and targets this node.
These control flow regions and their corresponding nesting structure can be exposed by performing an interval [26] or structural [37] analysis. The analysis result is a control tree [26] with basic blocks as leaves and abstract nodes representing the control flow regions as branches.
A linear region maps to a linear node in the control tree with the linear subgraph's entry and exit node as the node's left and right most child, respectively. A branch region maps to two control tree nodes: a branch node and a linear node. The branch node represents the region's alternatives with the corresponding nodes as its children. A linear node with three children can then be used to capture the rest of the branch region. Its first child is the region's entry node, the second child the branch node representing the alternatives, and the third child the region's exit node. Finally, a loop region maps to a loop node with the region's single node as its child. Figure 5a shows Euclid's algorithm as a CFG, and Figure 5b shows the same CFG after CFR, which restructured the head-controlled loop to a tail-controlled loop. The left of Figure 5c shows the corresponding control tree.
Demand Annotation: Structural analysis exposes the necessary control flow regions for a direct translation to an RVSDG. A control flow tree's branch and loop nodes can directly be mapped to γand θ-nodes, and individual instructions to simple nodes. However, a further necessity for the efficient generation of these RVSDG nodes is the exposure of their data dependencies. This is the task of demand annotation. It exposes these data dependencies by annotating control tree nodes with the variables that are demanded by the instructions within control flow regions. It accomplishes this using a read-write and demand-set annotation pass. The read-write pass annotates each control tree node with the set of read and written variables of the corresponding control flow region, while the demand-set pass uses these variables to annotate each control tree node with the set of demanded variables, i.e. variables that are necessary to fulfill the dependencies of the instructions within a control flow region.
Algorithm II shows the details of the two passes. The read-write pass annotates each node with the read set R and write set W . It processes the tree in post-order, building up the two sets from the innermost to the outermost nested control flow region. For linear nodes, the children are processed from right to left, i.e. bottom-up in the restructured CFG, to create the two sets. For branch nodes, a variable is only considered to be written, if it is in the write set of all the node's children, i.e. it was written in all alternatives of a conditional. The demand-set pass uses the read set R and write set W to construct a demand set D for each node. The algorithm is initialized with an empty set D t , which is used to keep track of demanded variables during traversal. The demand-set pass traverses the tree such that it follows a bottom-up traversal of the restructured CFG, adding and removing variables from D t during this traversal according to each node's rules. For branch nodes, each child is processed with a copy of D t , as the corresponding alternatives of the conditional are independent from another. For loop nodes, the θ-node's requirement that inputs and outputs must have the same signature necessitates that R is added to D t before the loop's body is processed. The right of Figure 5c shows the traversal order for the two passes along with the read, write, and demand set for each node of the control tree on the left.
Control Tree Translation: After demand annotation, each node of the control tree is annotated with the set of variables that its instructions require, i.e. their data dependencies. Finally, the control tree translation constructs a λ-region from the control tree along with its annotated demand sets. Algorithm III shows the details. The algorithm processes each node in the control tree creating γ-and θ-nodes for all branch and loop nodes, respectively. For the outputs of gamma nodes, the algorithm uses the demand set of the right sibling, which corresponds to the branch region's join node in the CFG. Figure 5d shows the resulting RVSDG nodes for the example.
Modeling Stateful Computations
Algorithm I adds an additional state argument and result to every λ-node. This state is used to sequentialize all stateful computations within a function. Nodes with side-effects consume this state and produce a new state for consumption by the next node. This single state ensures that the order of operations with side-effects in the RVSDG is according to the total order specified in the original program, ensuring correct observable behavior. Specifically, the use of a single state for sequentializing stateful operations ensures that the order of these operations in the RVSDG is equivalent to the order in the restructured CFG.
The utilization of a single state is, however, overly conservative, as different computations can have mutually exclusive side-effects. For example, the side-effect of a non-terminating loop is unrelated to a non-dereferencable load. These stateful computations can be modeled independently with the help of distinct states, as depicted in Figure 1f . This results in the explicit exposure of more concurrent computations, as loops with no memory operations would become independent from other loops with memory operations. Moreover, the possibility of encoding independent states can also be leveraged by analyses and optimizations. For example, alias analysis can directly encode independent memory operations into the RVSDG by introducing additional memory states. Pure functions could be easily recognized and optimized, as they would contain no operations that use the added states and therefore would only pass it through, i.e., the origin of the state result would be the λ-region's argument. 
Destruction
The destruction stage reestablishes control flow by extracting an IPG from an RVSDG as well as generating CFGs from individual λ-regions. Inter-Procedural Control Flow Recovery (Inter-PCFR) creates an IPG from λ-nodes, while Intra-Procedural Control Flow Recovery (Intra-PCFR) extracts control flow from γ-and θ-nodes and generates basic blocks with corresponding operations for primitive nodes. A λ-region without γ-and θ-nodes is trivially transformed into a linear CFG, whereas λ-regions with these nodes require the construction of branches and/or loops. The rest of this section discusses Inter-PCFR in detail. We refrain from an in-depth discussion of Intra-PCFR as it is covered in Bahmann et al. [3] .
Inter-Procedural Control Flow Recovery
Inter-PCFR recovers an IPG from an RVSDG. IPG nodes are created for λ-nodes as well as arguments of the ω-region, while IPG edges are inserted to capture the dependencies between λ-nodes. Algorithm IV starts by creating IPG nodes for all arguments of the ω-region, i.e., all external functions. It continues by recursively traversing the region tree, creating IPG nodes for encountered λ-nodes and IPG edges for their dependencies. For the region of every λ-node, it invokes Intra-PCFR to create a CFG.
Intra-Procedural Control Flow Recovery
Bahmann et al. [3] explored two different approaches for CFG generation: Structured Control Flow Recovery (SCFR) and Predicative Control Flow Recovery (PCFR). SCFR uses the region hierarchy within a λ-region to recover control flow, while PCFR generates branches for predicate producers and follows the predicate consumers to the eventual destination. Both schemes reestablish evaluation-equivalent CFGs, but differ in the recoverable control flow. SCFR recovers only control flow that resembles the structural nodes in λ-regions, i.e., control flow equivalent to if-then-else, switch, and do-while statements, while PCFR can recover arbitrary complex control flow, i.e., control flow that is not restricted to RVSDG constructs. PCFR reduces the number of static branches in the resulting control flow [3] , but might also result in undesirable control flow for certain architectures, such as graphic processing units [33] . For the sake of brevity, we omit a discussion of SCFR and PCFR as the algorithms are extensively described by Bahmann et al. [3] .
Optimizations
The properties of the RVSDG make it an appealing IR for optimizing compilers. Many optimizations can be expressed as simple graph traversals, where subgraphs are rewritten, nodes are moved between regions, nodes or edges are marked, or edges are diverted. In this section, we present Dead and Common Node Elimination optimizations that exploit the RVSDG's properties to unify traditionally distinct transformations. 
Dead Node Elimination
Dead Node Elimination (DNE) is a combination of dead and unreachable code elimination, and removes all nodes that do not contribute to the result of a computation. Dead nodes are generated by unreachable and dead code from the input program, as well as by other optimizations such as Common Node Elimination. An operation is considered dead code when its results are either not used or only by other dead operations. Thus, an output of a node is dead, if it has no users or all its users are dead. We consider a node to be dead, if all its outputs are dead. It follows that a node's inputs are dead, if the node itself is dead. We call all inputs, outputs, or nodes that are not dead alive. The implementation of DNE consists of two phases: mark and sweep. The mark phase identifies all outputs and arguments that are alive, while the sweep phase removes all dead entities. The mark phase traverses RVSDG edges according to the rules in Algorithm V. If a structural node is dead, the mark phase skips the traversal of its subregions as well as all of the contained computations, as it never reaches the node in the first place. The mark phase is invoked for all result origins of the ω-region.
The sweep phase performs a simple bottom-up traversal of an RVSDG, recursively processing subregions of structural nodes as long as these nodes are alive. A dead structural node is removed with all its contained computations. The RVSDG's uniform representation of all computations as nodes permits DNE to not only remove simple computations, but also compound computations such as conditionals, loops, or even entire functions. Moreover, its nested structure avoids the processing of entire branches of the region tree if they are dead. Figure 6d shows the RVSDG from Figure 6c after the mark phase. Grey colored entities are dead. The mark phase traverses the graph's edges, marking the γ-node's leftmost output alive. This renders the corresponding result origins of the γ-regions alive, then the leftmost output of the θ-node, and so forth. After the mark phase annotated all outputs and arguments as alive, the sweep phase removes all dead entities.
Common Node Elimination
Common Node Elimination (CNE) permits the removal of redundant computations by detecting congruent nodes. These nodes always produce the same results, enabling the redirection of their result edges to a single node. This renders the other nodes dead, permitting DNE to remove them. CNE is similar to common subexpression elimination and value numbering [2] in that it detects equivalent computations, but since the RVSDG represents all computations uniformly as nodes, it can be extended to conditionals [35] , loops, and functions. We consider two simple nodes n 1 and n 2 congruent, or n 1 ∼ = n 2 , if they represent the same computation, have the same number of inputs, i.e., |I n1 | = |I n2 |, and the inputs i k n1 and i k n2 are congruent,
Two inputs are congruent if their respective origins g k n1 and g k n2 are congruent, i.e., g k n1 ∼ = g k n2 . By definition, the origins of inputs are either outputs of simple or structural nodes, or arguments of regions. Origins from simple nodes are only equivalent when their respective producers are computationally equivalent, whereas for the other cases, it must be guaranteed that they always receive the same value.
The implementation of CNE consists of two phases: mark and divert. The mark phase identifies congruent simple nodes, while the divert phase diverts all edges of their origins to a single node, rendering all other nodes dead. Both phases of Algorithm VI perform a simple top-down traversal, recursively processing subregions of structural nodes annotating inputs, outputs, arguments, and results, as well as simple nodes as congruent. For γ-nodes, the algorithm marks only computations within a single region as congruent and performs no analysis between regions. In the case of θ-nodes, computations are only congruent when they are congruent before and after the loop execution, i.e., the inputs and results of two loop variables must be congruent. Figure 6b shows the RVSDG for the code in Figure 6a , and Figure 6b the RVSDG after CNE. Two of the four multiplications take the same inputs and are therefore congruent to each other. Thus, their result edges are redirected and they become dead. DNE can then remove both multiplications as shown in Figure 6d .
For simple nodes, the algorithm marks all nodes within a region that are congruent to a node n. In order to avoid costly traversals of all nodes for every node n, the mark phase takes the candidates from the users of the origin of n's first input. If there is another input from a simple node n ′ with the same operation and number of inputs among them, the other inputs from both nodes can be compared for congruence. Moreover, a region must store constant nodes, i.e. nodes without inputs, separately from other nodes so that the candidate nodes for constants are available. For commutative simple nodes, the 
Recursively process the θ-region. -λ-node: For all context variables cv 1 , cv 2 ∈ CV where i cv 1 ∼ = i cv 2 , mark a cv 1 ∼ = a cv 2 . Recursively process the λ-region.
-φ-node: For all context variables cv 1 , cv 2 ∈ CV where i cv 1 ∼ = i cv 2 , mark acv 1 ∼ = acv 2 . Recursively process the φ-region.
-ω-node: Recursively process the ω-region. 
to a lv 1 and from o lv 2 to o lv 1 . Recursively process the θ-region. -λ-node: For all context variables cv 1 , cv 2 ∈ CV where i cv 1 ∼ = i cv 2 , divert all edges from a cv 2 to a cv 1 . Recursively process the λ-region.
-φ-node: For all context variables cv 1 , cv 2 ∈ CV where i cv 1 ∼ = i cv 2 , divert all edges from a cv 2 to a cv 1 . Recursively process the φ-region.
-ω-node: Recursively process the ω-region.
inputs should be sorted before their comparison.
The presented algorithm only detects simple nodes as congruent within a region. For γ-nodes, congruence can also exist between nodes of different γ-regions and extending the algorithm would eliminate these redundancies. Another extension would be to permit congruence detection for structural nodes to implement conditional fusion [35] and loop fusion [25] . In the case of γ-nodes, it is sufficient to ensure that two nodes have congruent predicates, whereas for θ-nodes it would be necessary to permit congruence detection between different θ-regions to ensure that their predicates are the same.
Implementation and Evaluation
This section's goal is to demonstrate that the RVSDG has no inherent impediment that prevents it from producing competitive code and that it can serve as the IR in a compiler's optimization stage. The goal is not to outperform mature compilers, such as LLVM or GCC. This would require a significant engineering effort, which is outside the scope of this article. In light of this goal, we evaluate the RVSDG in terms of performance and size of produced code, as well as compilation time and representational overhead.
Implementation
We have implemented jlm, a publicly available [32] prototype compiler that uses the RVSDG for optimizations. Its compilation pipeline is outlined in Figure 7a . Jlm takes LLVM IR as input, constructs an RVSDG, transforms and optimizes this RVSDG, and destructs it again to LLVM IR. The SSA form of the input is destructed before RVSDG construction proceeds with Inter-and Intra-PT. This additional step is required due to the control flow restructuring phase of Intra-PT. Destruction discovers control -Node Reduction (RED): Performs simplifications, such as constant folding or strength reduction, similarly to LLVM's redundant instruction combinator (-instcombine), albeit by far not as many.
-Loop Unrolling (URL): Unrolls all inner loops by a factor of four. Higher factors gave no significant performance improvements in return for the increased code size.
θ − γ Inversion (IVT): Inverts γ-and θ-nodes where both nodes have the same predicate origin. This replaces the loop containing a conditional with a conditional that has a loop in its then-case.
We use the following optimization order: Figure 7b outlines our evaluation setup. We use clang 7.0.1 [8] to convert C files to LLVM IR, pre-optimize the IR with LLVM's opt, and then optimize it either with jlm, or opt using different optimization levels. The optimized output is converted to an object file with LLVM's llc. The pre-optimization step is necessary to avoid a re-implementation of LLVM's mem2reg pass, since clang allocates all values on the stack by default. We use the polybench 4.2.1 beta benchmark suite [30] to evaluate the RVSDG's usability and efficacy. This benchmark suite provides structurally small benchmarks, and therefore reduces the implementation effort for the construction and destruction phases, as well as the number and complexity of optimizations.
Evaluation Setup
The experiments are performed on an Intel Xeon E5-2695v4 running CentOS 7.4. The core frequency is pinned to 2.0 GHz to avoid performance variations and thermal throttling effects. All outputs of the benchmark runs are verified to equal the corresponding outputs of the executables produced by clang. Figure 8 shows the speedup at five different optimization levels. The O0 optimization level serves as baseline. The O3-no-vec optimization level is the same as O3, but without slp-and loop-vectorization. Optimization level O3-no-vec-stripped is the same as O3-no-vec, but the IR is stripped of named metadata and attribute groups before invoking llc. Since jlm does not support metadata and attributes yet, this optimization level permits us to compare the pure optimized IR against jlm without the optimizer providing hints to llc. We omit optimization level O2 as it was very similar to O3. The gmean column in Figure 8 shows the geometric mean of all benchmarks.
Performance
The results show that the executables produced by jlm (gmean 2.58) are faster than O1 (gmean 2.49), but slower than O3 (gmean 3.21), O3-no-vec (gmean 2.95), and O3-no-vec-stripped (gmean 2.92). Optimization level O3 tries to vectorize twenty benchmarks, but only produces measurable improvements for eight of them, namely atax, durbin, fdtd-2d, gemm, gemver, heat-3d, jacobi-1d, and jacobi-2d. Jlm would require a vectorizer to achieve similar speedups.
Disabling vectorization with O3-no-vec and O3-no-vec-stripped shows that jlm achieves similar speedups for fdtd-2d, gemm, heat-3d, javobi-1d, and jacobi-2d. The metadata transferred between the optimizer and llc only makes a significant difference for durbin, floyd-warshall, gesummv, jacobi-1d, and nussinov. In the case of gesummv and jacobi-1d, performance drops below jlm. Jlm is outperformed by optimization level O1 at six benchmarks: adi, durbin, floyd-warshall, nussinov, seidel-2d, and syrk. We inspected the output files and found the following causes:
adi: Jlm fails to eliminate load instructions from the two innermost loops. These loads have loopcarried dependences with a distance of one to store instructions in the same loop, and can be eliminated by propagating the stored value to the users of the load's output. The LLVM pass that performs this optimization is loop load elimination (-loop-load-elim). If this transformation is performed by hand on the two loops, then jlm achieves the same performance as O1.
durbin: Jlm fails to transform a loop that copies values between arrays to a memcpy intrinsic. This impedes LLVM's code generator to produce better code. The LLVM pass responsible for this transformation is the loop-idiom pass (-loop-idiom). If the loop is replaced with a call to memcpy, then jlm is better than O1.
floyd-warshall : Jlm fails to move instructions out of the innermost loop due to loads and stores impeding their hoisting. Currently, all loads and stores are sequentialized using a single state. This has the consequence that invariant loads/stores might not appear as invariant due to their state edge originating from a another non-invariant load or store. This in turn also renders other instructions non-hoistable as they might be dependent on one of these non-hoistable loads. An alias analysis pass would resolve this problem as it would render loop invariant loads/store independent of non-invariant ones.
nussinov : Similarly to floyd-warshall, the overly strict sequentialization of load and store instructions impedes further optimizations. In this case, it is the application of CNE. Loads from the same address are not detected as congruent due to different state edge origins. Again, this has a cascading effect to other instructions and an alias analysis pass would resolve this problem.
seidel-2d : Similarly to adi, jlm fails to eliminate load instructions from the innermost loop. If the load elimination is performed by hand, then jlm achieves the same performance as O1.
syrk : Similarly to nussinov, jlm fails to satisfactorily apply CNE due to an overly strict sequentialization of load and store instructions. Figure 8 shows that it is feasible to produce competitive code using the RVSDG, but also that more optimizations and analyses are required in order to reliably do so. The differences in performance are not due to inherent characteristics of the RVSDG, but can be attributed to missing analyses, optimizations, and heuristics for their application. Specifically, jlm requires more complex analyses as well as more optimizations exploiting the results of these analyses in order to compete with mature compilers at more complex benchmarks. In particular, an alias analysis pass is required as the results above and the number of LLVM pass invocations from Table 1 indicate. Figure 9 shows the code size for O3, O3-no-vec, Os, and for jlm with and without loop unrolling. The amean column shows the arithmetic mean of all benchmarks.
Code Size
Optimization level O3 produces on average text sections that are 11% bigger than O3-no-vec. Vectorization often requires loop transformations, such as loop unrolling, to make loops amenable to the vectorizer, and the insertion of pre-and post-loop code. This affects code size negatively, but can result in better performance. The results also show that Os consistently produces smaller text sections than O3-no-vec. This is due to more conservative optimization heuristics and the omission of other optimizations, e.g., aggressive instruction combination (-aggressive-instcombine) or the promotion of by-reference arguments to scalars (-argpromotion).
In comparison to Os, jlm produces ca. 39% bigger text sections. The experiments without loop unrolling show that this can be attributed to the naive heuristic used for this optimization. Jlm does not take code size into account and unrolls every inner loop unconditionally four times, leading to excessive code expansion. Avoiding unrolling completely results in text sections that are on average between O3-no-vec and Os. This indicates that the excessive code size is due to naive heuristics and shortcomings in the implementation, but not to inherent characteristics of the RVSDG. Figure 10 shows the overhead in terms of IR size and time for the RVSDG. Figure 10a shows the representational overhead by relating the number of instructions in the LLVM module to the number of RVSDG nodes after construction, whereas Figure 10b relates the number of instructions in the LLVM module to the time spent on RVSDG construction and optimizations. Figure 10a shows a clear linear relationship for all cases, confirming the observations by Bahmann et al. [3] that the RVSDG is feasible in terms of space requirements. Figure 10b also indicates a linear dependency, but with larger variations for similar input sizes. This variation can be attributed to the fact that construction and optimizations are also compounded by input structure. Structural differences in the inter-procedure and control flow graphs lead to runtime variations in RVSDG construction and different runtimes for optimizations. For example, the presence of loops in a translation unit determines whether loop unrolling is performed, while their absence avoids the runtime overhead for this optimization completely. Overall, Figure 10 suggests that the RVSDG is feasible as an IR for optimizing compilers in terms of compilation overhead.
Compilation Overhead
Conclusion
This paper presents a complete specification for representing entire programs in the RVSDG IR for an optimizing compiler. We provide construction and destruction algorithms, and show the RVSDG's efficacy as an IR for analyses and optimizations by presenting Dead Node and Common Node Elimination. We implemented jlm, a publicly available [32] compiler that uses the RVSDG for optimizations, and evaluate it in terms of performance, code size, compilation time, and representational overhead. The results suggest that the RVSDG combines the abstractions of data centric IRs with the CFG's advantages to optimize and generate efficient control flow. This makes the RVSDG an appealing IR for optimizing compilers. A natural direction for future work is to explore how features such as exceptions can be efficiently mapped to the RVSDG. Another research direction would be to extend the number of optimizations and their heuristics in jlm to a competitive level with CFG-based compilers. This would provide further information about the number of necessary optimizations, their complexity, and consequently the required engineering effort.
