In an optimizing compiler, the register allocation process is still a crucial phase since it allows to reduce spill code that damages the performances. 
Introduction
Because of the introduction of instruction level parallelism (ILP), the classical techniques of register allocation for sequential code semantics are not adapted any more. Thus, the old graph coloring techniques should be reconsidered to be efficient in optimizing compilers for modern architectures. In [5] , the authors showed that there is a phase ordering problem between the old register allocation techniques and ILP instruction scheduling. If a classical register allocation is done early, the introduced false dependences inhibit a good further ILP extraction. However, this conclusion does not prevent any compiler from performing effectively an early register allocation, but with the condition that the used allocator should be sensitive to the scheduler as done in [10, 6] .
Some other studies [1, 9, 5] claim that it is better to combine instruction scheduling and register allocation in a single complex pass, argumenting the fact that applying each method separately has a negative influence on the efficiency of the other. However, this phase ordering problem arises only if the applied first pass (ILP scheduler or register allocator) is "selfish". Indeed, we still can effectively decouple register constraints from instruction scheduling if enough care is taken. In this paper, we show how we can treat register constraints before scheduling and we explain why we should do it.
The principal reason for handling register constraints before instruction scheduling our believe that the register allocation is more important as an optimization issue than code scheduling. This is because, usually, the code performances are far more sensitive to memory accesses than to fine-grain scheduling (memory gap): a cache miss may inhibit the processor from achieving a high dynamic ILP, even if the scheduler has extracted it at compile time. Even if someone would expect that spill codes exhibit high locality, and hence would likely produce cache hits, we cannot assert it at compile time since memory access latencies are non predictable at compile time (we cannot guarantee where the date would be located). The authors in [4] related that about 66% of application execution times are spent to satisfying memory requests. Furthermore, memory requests, even if they are data independent, exhibit high potential conflicts because of micro-architectural restrictions and simplifications in the memory disambiguation mechanisms (load/store queues) and possible banking structure in cache levels [8] . These possible conflicts may cause severe performance degradation even if enough ILP exist, and even if the data is located in the cache. Of course, our claim that spill code is more damaging is appropriate for the architectures where the memory access delay is very long compared to the delay of calculation. This is the case in almost all high performance processors.
Another reason for handling register constraints prior to ILP scheduling is that register constraints are much more complex than resource constraints. Scheduling under resource constraints is a performance issue. Given a data dependence graph (DDG), we are sure to find at least one valid schedule for any underlying hardware properties (a sequential schedule in extreme case, i.e., no ILP). However, scheduling a DDG with a limited number of registers is more complex. We cannot guarantee the existence of at least one schedule. In some cases, we must introduce spill code and hence we change the problem (the input DDG). Also, a combined pass of scheduling with register allocation presents an important drawback if not enough registers are available. During scheduling, we may need to insert load-store operations if no enough free registers exist. We cannot guarantee the existence of a valid issue time for these introduced memory access in an already scheduled code; resource or data dependence constraints may prevent from finding a valid issue slot inside an already scheduled code. This fact forces to iteratively apply scheduling followed by spilling until reaching a solution. Even if we can experimentally reduce the backtracking as in [15] , this iterative aspect adds a high algorithmic complexity factor to the pass integrating both register allocation and scheduling.
All the above arguments make us re-think new ways of handling register pressure before starting the scheduling process, so that the scheduler would be free from register constraints and would not suffer from excessive serializations. For this reason, we presented in [13] our register saturation (RS) concept that prevent a DAG from producing an excessive number of values simultaneously alive for all the valid schedules. Our pre-pass analyzes a DAG (with respect to control flow) to deduce the maximal register need for all schedules. We call this limit the register saturation (RS) because the register need can reach this limit but never exceed it. If RS exceeds the number of available registers, we introduce new arcs to reduce it, see Figure 1 . In this paper, we provide exact (optimal) methods to both the problems of computing RS and reducing it. After our RS analysis pass, the DAG is free from register constraints and can be sent to the scheduler and the register allocator. presents our DAG and processor model which can be used for most of existing ILP architectures (superscalar, VLIW, EPIC/IA64). Computing the optimal RS by intLP is given in Section 3. Our intLP formulation use the linear writing of logical formulas (=⇒, ⇐⇒, ∨) and the max operator (max(x, y)) by introducing extra binary variables, as previously described in [14] . The optimal solution of reducing RS is provided in Section 4. Finally, we conclude with a brief discussion.
DAG and Processor Model
A DAG G = (V, E, δ) in our study represents the data dependences between the operations and any other serial constraints. The DAG is defined by its set of operations V , its set of edges E = {(u, v)/ u, v ∈ V }, and δ such that δ(e) is the latency of the edge e in terms of processor clock cycles. Let n be the number of nodes and m the number of arcs.
A schedule σ of G is a function which gives an integer execution (issue) time for each operation:
The set of all valid acyclic schedules of G is denoted by Σ(G).
We consider a target RISC-style architecture with multiple register types, where T denotes the set of register types (for instance, T = {int, f loat}). Some operations and some precedence constraints of a DAG have more attributes than others, depending if they refer to values to be stored in registers or not. V R,t is the set of values to be stored in registers of type t ∈ T . We consider that each operation u ∈ V writes into at most one register of a type t ∈ T . The operations which define multiple values with different types are accepted in our model if they do not define more than one value of a certain type
1 . E R,t is the set of flow dependence edges through a value of type t ∈ T . The set of consumers (readers) of a value u t is then the set:
Some values in may not be consumed in the considered DAG. In order to model such exit values, we assume that the considered DAG contains a virtual bottom node ⊥ that is the sink of the flow dependences of these exit values. Also, there is a serial arc from any other node of the DAG to this bottom node. The latency of such virtual arc is equal to the latency of the source operation. The bottom node ⊥ is always the last scheduled node of the DAG.
In order to consider static issue VLIW and EPIC/IA64 processors in which the hardware pipeline steps are visible to compilers (we consider dynamically scheduled superscalar processors too), we assume that reading from and writing into a register may be delayed from the beginning of the schedule time, and these delays are visible to the compiler (architectural visible). We define two delay (offset) functions δ r and δ w in which: the read cycle of u t from a register of type t is σ(u)+δ r (u), and the the write cycle of u t into a register of type t is σ(u) + δ w (u). For instance, in superscalar and EPIC/IA64 processors, δ r and δ w are equal to zero.
When a schedule is fixed, we can easily compute how much register we need of each register type t in order to build a valid register allocation. It is the standard concept of the maximal number of values of type t simultaneously alive, that is also equal to the maximal clique in the interference graph. To recall, two variables are said to be simultaneously alive iff their lifetime intervals interfere, and thus they cannot share the same register. The register requirement (or register need) of type t for a DAG G given a fixed schedule σ is noted RN
Computing the Optimal Register Saturation
First of all, if |V R,t |, the total number of values of type t, is less than or equal to R t , the number of available registers of type t, then we are sure that any schedule cannot require more than |V R,t | ≤ R t registers. Otherwise, we must analyze the register saturation (RS).
The RS of a register type t for a DAG G is the maximal register need for all valid schedules of this DAG:
We proved in [13] that computing this parameter is an NP-complete problem and we provided a heuristics. Below, we give the set of variables and constraints of an exact intLP for computing RS t (G). Our intLP formulation use the linear writing of logical formulas (=⇒, ∨, ⇐⇒) and the max operator (max(x, y)) by introducing extra binary variables, as previously described in [14] . However, that linear writing of logical and max operators requires to bound the domain set of the integer variables.
Scheduling Variables For all operations u ∈ V , we define the integer variable σ u ≥ 0 that holds the schedule time. Note that these schedule variables do not represent the final schedule under resource constraints (that will be computed after our RS pass), but they only represent intermediate variables for our intLP formulation. The first linear constraints are those that describe precedence relations, so we write into the intLP system:
There are O(n) scheduling variables and O(m) linear scheduling constraints 2 . In order to bound the domain set of our variables, we define T a worst possible schedule time. We choose T sufficiently large, where for instance T = e∈E δ(e) is a suitable worst total schedule time (case of no ILP). Let σ u be the longest path from the start node to u, and σ u be the longest path from u to the sink node in the DAG. We deduce that ∀u ∈ V :
is the "as soon as possible" schedule time;
• σ u ≤ σ u = T − LongestP athF rom(u) is the "as late as possible" schedule time according to the worst total schedule time T .
Register Need Constraints
Interference Graph The lifetime interval of a value u t of type t is (given a schedule σ)
That is, we assume that a value written at instant c in a register is available one step later (the lifetime interval is left open). Thus, if an operation u reads from a register at instant c while another operation v is writing in it at the same time, u does not get v's result but gets the value previously stored in this register. Note that this is a choice and not a limitation of the model. We define for each value u t the variable k u t ≥ 0 that computes its killing date (the last time that this value is read). The number of such defined variables isO(n). Since our variable domains are bounded (assuming a finite T ), we know that k u t is bounded by the two following finite schedule times:
is the first possible definition date of u t ;
• k u t = max v∈Cons(u t ) σ v + δ r (v) is the latest possible killing date of u t .
We use the linear constraints of the max operator to compute k u t as explained in [14] . We write into the intLP system:
The total complexity to define all killing dates for all registers types is bounded by O(n 2 ) variables and O(n 2 ) constraints. Now, we can consider H t the undirected interference graph of G for the register type t. For any couple of distinct values u t , v t ∈ V R,t , we define a binary variable s t u,v ∈ {0, 1} such that it is set to 1 if the two lifetimes intervals of type t interfere: ∀t ∈ T , ∀ couple u t , v t ∈ V R,t :
The number of variables s t u,v is the number of combinations of 2 values among |V R,t |, i.e., |V R,t | × (|V R,t | − 1) /2.
where
, these variables are constrained as follows :
Given three logical expressions (P, Q, S), (P ⇐⇒ (Q∧ S)) is equivalent to the expression (P ∧ Q ∧ S) ∨ (¬P ∧ ¬Q) ∨ (¬P ∧ ¬S). We write these two disjunctions with linear constraints by introducing binary variables (see [14] ) and by computing the finite lower bounds of the linear functions. The complexity of computing all the s t u,v variables is bounded by O(n 2 ) binary variables and constraints.
Maximal Clique in the Interference Graph
The maximum number of values of type t simultaneously alive corresponds to a maximal clique in H t = (V R,t , E t ), where (u t , v t ) ∈ E t iff their lifetime intervals interfere (s t u,v = 1). For simplicity, rather than considering the interference graph itself, we prefer to consider its complementary graph H t = (V R,t , E t ) where (u t , v t ) ∈ E t iff their lifetime intervals do not interfere (s t u,v = 0). Then, the maximum number of values of type t simultaneously alive corresponds to a maximal independent set in H t .
To write the constraints that describe independent sets (IS), we define a binary variable x u t ∈ {0, 1} for each value u t ∈ V R,t such that x u t = 1 iff u t belongs to some IS of H t . We express in the model the following linear constraints:
This equations means that if two nodes u and v are connected in H , then one and only one of them may belong to an IS. The number of variables x u t is O(n). The number of introduced binary variables to express all the implications is bounded by O(n 2 ). The number of linear constraints to define the IS is bounded by O(n 2 ).
Linear Function of Register Need
The register requirement of type t is a maximal IS in H t , i.e., the maximal u t ∈VR,t x u t . Thus, the register saturation of type t is computed by:
The total number of integer variables in our whole intLP is bounded by O(|V | 2 ), and the total number of constraints is at most O(m + n 2 ). Note that our intLP formulation may be optimized by considering that:
• an edge e = (u, v) in the initial DAG is redundant for the scheduling constraints and can be safely ignored if lp(u, v) > δ(e) where lp(u, v) denotes the longest path from u to v (with the condition that this arc doesn't belong to this longest path);
• two values (u t , v t ) ∈ V R,t can never be simultaneously alive iff for all the possible schedules, one value is always defined after the killing date of the other. This is the case if any of the two following conditions is satisfied:
After computing the optimal RS, the next section shows how we reduce it if it exceeds a limit.
Optimal Register Saturation Reduction
In the case where the register saturation RS t (G) exceeds the number of available registers R t of the type t, then we must add extra serial arcs into the DAG G to reduce RS t (G) below this limit. The new added arc must save ILP as much as possible by taking care of the critical path. We note by E the set of extra edges that we add to G to build a new extended DAG, namely G = G\ E , such that RS t (G) ≤ R t . We want to solve the formal problem stated below.
Definition 4.1 (ReduceRS Problem)
Let G = (V, E, δ) be a DAG. Let R t and P be two positive integers. Does there exist an extended DDG G = G\ E of G such that:
CriticalP ath(G) ≤ P
Theorem 4.1 ReduceRS problem is NP-hard.

Proof :
We prove that ReduceRS problem reduces from the problem of scheduling under register constraints. Let us start by defining the latter problem. For the sake of clarity of this proof, we assume that the considered register type t is implicit (we do not include t in our notations inside this proof).
Definition 4.2 (SRC problem)
Let G = (V, E, δ) be a DAG, R be a positive integer, and P be a length. Does there exist a valid schedule σ ∈ Σ(G) such that:
and total schedule time ≤ P SRC problem has been proved NP-hard in [3] . Now we prove that both ReduceRS and SRC problems are equivalent in terms of computational complexity.
1. ReduceRS =⇒ SRC Let G be a solution for the ReduceRS problem. Then trivially, any as soon as possible schedule σ ∈ Σ(G) is a solution for SRC.
SRC =⇒ ReduceRS
Let σ be a solution for SRC, i.e., RN σ (G) ≤ R and the total schedule time is ≤ P. We build an extended DDG G by adding serial arcs to impose value lifetimes of any schedule of G to have the same precedence relations as defined by σ. ∀u, v ∈ V R /LT σ (u) ≺ LT σ (v) then we add the following arcs:
• if v ∈ Cons(u), then add serial arcs from the other u's readers (except v) to v; the set of added arcs is:
• else, add serial arcs from all u's readers to v; the set of added arcs is:
The latency of these added arcs has to be chosen depending on the target codes. We have two cases, 1. in the case of superscalar codes, the semantics is sequential. So, the latency of each added arc is set to 1;
2. in the case of VLIW or EPIC/IA64, there exist reading and writing offsets 3 . Thus, for each added arc e = (u , v), the latency is set to δ(e) = δ r (u ) − δ w (v).
Indeed, the added arcs and the chosen latencies force the following assertion:
Then, for all values non simultaneously alive according to σ, there is no schedule σ of G that makes them simultaneously alive. Formally, it is written:
In other words, we ensure that any schedule of G will guarantee the precedence relations between the lifetime intervals of G according σ. Consequently, any schedule σ of G cannot need more than the register need of σ and
A solution for SRC problem may create a circuit in the solution of ReduceRS. We are sure that if any circuit is introduced in G, then it must be non-positive because there exists at least the valid schedule σ ∈ Σ(G). Consequently, a solution of the ReduceRS problem may produce a cyclic DDG. We will see later how to eliminate these solutions.
With regard to the critical path of G, the introduced serial arcs ensure that at least σ ∈ Σ(G). Since there exists such a schedule with a total time ≤ P, the critical path of G cannot be longer than P.
The proof of Theorem 4.1 gives the intuition for optimal solution of the ReduceRS problem using integer programming. It is computed in two steps:
1. we first compute a valid schedule σ such that the register need of type t is maximized and does not exceed R t , while the total schedule time is bounded. Again, this schedule is different from the final one to be computed under resource constraints;
2. then, we add serial arcs as described by the proof of Theorem 4.1. This results in an extended DDG that has a bounded register saturation with a minimized critical path.
To compute such a schedule, we use our intLP formulation previously defined in Section 3 that maximizes the register need. We keep all the constraints and variables of Section 3, except those that compute a maximal independent set. The intLP system tries to build a coloring of the interference graph with exactly R t colors (the maximal number of available registers). Now, we use a binary variable x i u t which is set to 1 if the value u t is assigned to the color i. Since there are R t available registers (colors), we have at most |V |×R t variables. Since R t is a constant is our problem (the number of registers in the target machine), the number of these variables is O(|V |).
If no solution can be found with R t colors, then solve another intLP after decrementing R t (until to 1). If no final solution can be found when reaching one color, then the register saturation cannot be reduced and spilling is unavoidable. The variables x i u t are computed using the following constraints.
• a value u t is assigned to only one register (color) of type t:
• if two values interfere, then they cannot share the same color: ∀t ∈ T , ∀ couple u t , v t ∈ V R,t :
• The objective function minimizes the total schedule time: Minimise σ ⊥ . Or we can alternatively remove the objective function and bound the total schedule time by writing: σ ⊥ ≤ P.
As explained before, our DAG and processor model include writing and reading offsets. Consequently, in some cases, the optimal RS reduction may need to introduce non-positive circuits into the original DAG. Even if such non-positive circuits do not prevent the graph from being scheduled, they still violate the DAG property and impose hard scheduling constraints that may not be satisfiable under resource constraints in the subsequent pass of instruction scheduling. We must eliminate such optimal solutions as explained in the following section.
Eliminating Circuits with Non-positive Latencies
We must remind that the purpose of the register saturation analysis is to proceed by ensuring in the first steps of compilation that any schedule of a given DAG will not require more registers than those available. The scheduling phase is mainly constrained by resources (functional units) of the target architecture. If the extended DDG produced by the register saturation reduction contains a non-positive circuit, we cannot guarantee the existence of a schedule under resource constraints. This is because non-positive circuits introduce some scheduling constraints of types "not later than" which may not be satisfied in the presence of resource constraints.
For instance, let us assume a zero weighted circuit between two operations u and v. Theoretically, any schedule such that σ(u) = σ(v) satisfies this zero weighted circuit. However, if we have a resource constraint such that the two operations conflict with each other if they are scheduled at the same issue time, then there is not a valid schedule that meets these constraints. When we reduce the register saturation, we must ensure than there is always a schedule for any resource constraints.
Note that this problem does not arise for superscalar (sequential) codes because all the introduced edges have a positive latency equal to 1. Then, this typical problem may only arise for VLIW and EPIC/IA64. 4 . To eliminate this problem, we must restrict the extended graph G to remain a DAG. This is done by guaranteeing the existence of a topological sort for this graph. For this purpose, we add some variables and constraints to the optimal intLP system.
• We define integer variables that holds a topological sort of the graph. For each u ∈ V , we associate an integer variable d u .
[13] enables us to give nearly optimal heuristics. In the presence of branches, global RS of an acyclic CFG is brought back to RS in DAGs (basic blocs) by inserting entry and exit values with the corresponding flow arcs (see [14] ). If RS exceeds the number of available registers, we must reduce it while minimizing the increase of critical path. This is an NP-hard problem. An optimal exact RS reduction method based on integer programming is presented. If we assume writing offsets (VLIW and EPIC codes), some optimal solutions may require to insert non-positive circuits in the original DAG. These circuits may prevent the extended DDG from being scheduled in the presence of resource constraints. A sufficient and necessary condition to overcome this problem is to guarantee the existence of a topological sort for the extended graph. This is done by adding new constraints to the intLP formulation. We have also demonstrated that our initial algorithmic heuristics for RS reduction is very efficient compared to the optimal solutions.
The literature contains a lot of techniques about minimizing the register requirement in superscalar (sequential) codes that are sensitive to ILP scheduling [11, 10, 6] . Others prefer to combine ILP scheduling with register allocation [12, 1, 2, 5, 7] . All these techniques try to minimize the register requirement while the basic problem is not to reduce the register need, but not to exceed a limit. Consequently, minimizing the register requirement is inherently a worse technique than saturating it because of many reasons. First, if the register saturation is lower than the number of available registers, we do not add any arc; this is not the case of the existing techniques that may introduce obsolete arcs even if enough registers exist. Second, if the register requirement may exceed the limit, we introduce a minimized number of arcs by reducing the register saturation below that limit instead of minimizing the register need at the lowest possible level. Third and last, our framework is more general since it admits multiple register types, and takes into account VLIW, EPIC and superscalar codes.
