Abstract. The registers constraints can be taken into account during the scheduling phase of an acyclic data dependence graph (DAG) : any schedule must minimize the register requirement. In this work, we mathematically study and extend the approach which consists of computing the exact upper-bound of the register need for all the valid schedules, independently of the functional unit constraints. A previous work (URSA) was presented in 5, 4]. Its aim was to add some serial arcs to the original DAG such that the worst register need does not exceed the number of available registers. We write an appropriate mathematical formalism for this problem and extend the DAG model to take into account delayed read from and write into registers with multiple registers types. This formulation permits us to provide in this paper better heuristics and strategies (nearly optimal), and we prove that the URSA technique is not su cient to compute the maximal register requirement, even if its solution is optimal.
Introduction and Motivation
In Instruction Level Parallelism (ILP) compilers, code scheduling and register allocation are two major tasks for code optimization. Code scheduling consists of maximizing the exploitation of the ILP o ered by the code. One factor that inhibits such use is the registers constraints. A limited number of registers prohibits an unbounded number of values simultaneously alive. If the register allocation is carried out before scheduling, false dependencies are introduced because of the registers reuse, producing a negative impact on the available ILP exposed to the scheduler. If scheduling is carried out before, spill code might be introduced because of an insu cient number of registers. A better approach is to make code scheduling and register allocation interact with each other in a complex combined pass, making the register allocation and the scheduling heuristics very correlated.
In this article, we present our contribution to avoiding an excessive number of values simultaneously alive for all the valid schedules of a DAG, previously studied in 4, 5] . Our pre-pass analyzes a DAG (with respect to control ow) to deduce the maximal register need for all schedules. We call this limit the register saturation (RS) because the register need can reach this limit but never exceed it. We provide better heuristics to compute and reduce it if it exceeds the number of available registers by introducing new arcs. Experimental results show that in most cases our strategies are nearly optimal.
This article is organized as follows. Section 2 presents our DAG model which can be used for both superscalar and VLIW processors (UAL and NUAL semantics 15]). The RS problem is theoretically studied in Sect. 3 . If it exceeds the number of available registers, a heuristic for reducing it is given in Sect. 4 . We have implemented software tools, and experimental results are described in Sect. 5 . Some related work in this eld is given in Sect. 6 . We conclude with our remarks and perspectives in Sect. 7.
DAG Model
A DAG G = (V; E; ) in our study represents the data dependences between the operations and any other serial constraints. Each operation u has a strictly positive latency lat(u). The DAG is de ned by its set of operations V , its set of arcs E = f(u; v)= u; v 2 V g, and such that (e) is the latency of the arc e in terms of processor clock cycles.
A schedule of G is a positive function which gives an integer execution (issue) time for each operation :
is valid () 8e = (u; v) 2 E (v) ? (u) (e) We note by (G) the set of all the valid schedules of G. Since writing to and reading from registers could be delayed from the beginning of the operation schedule time (VLIW case), we de ne the two delay functions r and w such that w (u) is the write cycle of the operation u, and r (u) is the read cycle of u. In other words, u reads from the register le at instant (u) + r (u), and writes in it at instant (u) + w (u).
To simplify the writing of some mathematical formulas, we assume that the DAG has one source (>) and one sink (?). If not, we introduce two ctitious nodes (>; ?) representing nops (evicted at the end of the RS analysis). We add a virtual serial arc e 1 = (>; s) to each source with (e 1 ) = 0, and an arc e 2 = (t; ?) from each sink with the latency of the sink operation (e 2 ) = lat(t).
The total schedule time of a schedule is then (?). The null latency of an added arc e 1 is not inconsistent with our assumption that latencies must be strictly positive because the added virtual serial arcs no longer represent data dependencies. Furthermore, we can avoid introducing these virtual nodes without any consequence on our theoretical study since their purpose is only to simplify some mathematical expressions.
When studying the register need in a DAG, we make a di erence between the nodes, depending on whether they de ne a value to be stored in a register or not, and also depending on which register type we are focusing on (int, oat, etc.). We also make a di erence between edges, depending on whether they are ow dependencies through the registers of the type considered : { V R V is the subset of operations which de ne a value of the type under consideration (int, oat, etc.), we simply call them values. We assume that at most one value of the type considered can be de ned by an operation. The operations which de ne multiple values are taken into account if they de ne at most one value of the type considered.
{ E R E is the subset of arcs representing true dependencies through a value of the type considered. We call them ow arcs.
{ E S = E ? E R are called serial arcs. Figure 1 .b gives the DAG that we use in this paper constructed from the code of part (a). In this example, we focus on the oating point registers : the values and ow arcs are shown by bold lines. We assume for instance that each read occurs exactly at the schedule time and each write at the nal execution step ( r (u) = 0, w (u) = lat(u) ? 1). 
Register Saturation Problem
The RS is the maximal register need for all the valid schedules of the DAG :
We call a saturating schedule i RN (G) = RS(G). In this section, we study how to compute RS(G). We will see that this problem comes down to answering the question \which operation must kill this value ?" When looking for saturating schedules, we do not worry about the total schedule time. Our aim is only to prove that the register need can reach the RS but cannot exceed it. Minimizing the total schedule time is considered in Sect. 4 when we reduce the RS. Furthermore, for the purpose of building saturating schedules, we have proven in 16] that to maximize the register need, looking for only one suitable killer of a value is su cient rather than looking for a group of killers : for any schedule that assigns more than one killer for a value, we can obviously build another schedule with at least the same register need such that this value is killed by only one consumer. So, the purpose of this section is to select a suitable killer for each value to saturate the register requirement.
Since we do not assume any schedule, the life intervals are not de ned so we cannot know at which date a value is killed. However, we can deduce which consumers in Cons(u) are impossible killers for the value u. If v 1 ; v 2 2 Cons(u) and 9 a path (v 1 v 2 ), v 1 is always scheduled before v 2 with at least lat(v 1 ) processor cycles. Then v 1 can never be the last read of u (remember that we assume strictly positive latencies). We can consequently deduce which consumers can \potentially" kill a value (possible killers). We note pkill G (u) the set of the operations which can kill a value u 2 V R :
pkill G (u) = v 2 Cons(u)= # v \ Cons(u) = fvg One can check that all operations in pkill G (u) are parallel in G. Any operation which does not belong to pkill G (u) can never kill the value u.
(2) Proof. A complete proof is given in 17], page 13.
A potential killing DAG of G, noted PK(G) = (V; E P K ), is built to model the potential killing relations between operations, (see Fig. 1 .c), where :
There may be more than one operation candidate for killing a value. Let us begin by assuming a killing function which enforces an operation v 2 pkill G (u) to be the killer of u 2 V R . If we assume that k(u) is the unique killer of u 2 V R , we must always verify the following assertion :
There is a family of schedules which ensures this assertion. To de ne them, we extend G by new serial arcs that enforce all the potential killing operations of each value u to be scheduled before k(u). This leads us to de ne an extended DAG associated to k noted G !k = Gn E k where : Provided a valid killing function k, we can deduce the values which can never be simultaneously alive for any 2 (G !k ). Let # R (u) =# u \ V R be the set of the descendant values of u 2 V . Lemma 2. Given a DAG G = (V; E; ) and a valid killing function, then :
1. the descendant values of k(u) cannot be simultaneously alive with u :
2. there exists a valid schedule which makes the other values non descendant of k(u) simultaneously alive with u, i.e. 8u 2 V R 9 2 (G !k ) : 
A Heuristic for Computing the RS
This section presents our heuristics to approximate an optimal k by another valid killing function k . We have to choose a killing operation for each value such that we maximize the parallel values in DV k (G). Our heuristics focus on the potential killing DAG PK(G), starting from source nodes to sinks. Our aim is to select a group of killing operations for a group of parents to keep as many descendant values alive as possible. The main steps of our heuristics are :
1. decompose the potential killing DAG PK(G) into connected bipartite components ; 2. for each bipartite component, search for the best saturating killing set (dened below) ; 3. choose a killing operation within the saturating killing set (de ned below). We decompose the potential killing DAG into connected bipartite components (CBC) in order to choose a common saturating killing set for a group of parents. { E cb E P K is a subset of the potential killing relations ; { S cb V R is the set of the parent values, such that each parent is killed by at least one operation in T cb ; { T cb V is the set of the children, such that any operation in T cb can potentially kill at least a value in S cb . A bipartite decomposition of the potential killing graph PK(G) is the set (see Fig. 2.d 1. Greedy-k always produces a valid killing function k ; 2. PK(G) is an inverted tree =) Greedy-k is optimal. Proof. Complete proofs for both (1) and (2) are given in 17], pages 31 and 44 resp.
Since the approximated killing function k is valid, Theorem 1 ensures that we can always nd a valid schedule which requires exactly jAM k j registers. As consequence, our heuristic does not compute an upper bound of the optimal register saturation and then the optimal RS can be greater than the one computed by Greedy-k. A conservative heuristic which computes a solution exceeding the optimal RS cannot ensure the existence of a valid schedule which reaches the computed limit, and hence it would imply an obsolete RS reduction process and a waste of registers. The validity of a killing function is a key condition because it ensures that there exists a register allocation with exactly jAM k j registers.
As summary, here are our steps to compute the RS :
1. apply Greedy-k on G. The result is a valid killing function k ; 2. construct the disjoint value DAG DV k (G) ; 3. nd a maximal antichain AM k of DV k (G) using Dilworth decomposition 10] ; Saturating values are then AM k and RS (G) = jAM k j RS(G). If this path is greater than the critical path in G i , then ! 2 is the di erence between them, 0 otherwise. At the end of the algorithm, we apply a general veri cation step to ensure the potential killing property proven in Lemma 1 for the original DAG. We have proven in Lemma 1 that the operations which do not belong to pkill G (u) cannot kill the value u. After adding the serial arcs to build G, we might violate this assertion because we introduce some arcs with negative latencies. To overcome this problem, we must guarantee the following assertion : 8u 2 V R ; 8v 0 
Algorithm 2 
Experimentation
We have implemented the RS analysis using the LEDA framework. We carried out our experiments on various oating point numerical loops taken from various benchmarks (livermore, whetsone, spec-fp, etc.). We focus in these codes on the oating point registers. The rst experimentation is devoted to checking Greedyk e ciency. For this purpose, we have de ned and implemented in 16] an integer linear programming model to compute the optimal RS of a DAG. We use CPLEX to resolve these linear programming models. The total number of experimented DAGs is 180, where the number of nodes goes up to 120 and the number of values goes up to 114. Experimental results show that our heuristics give quasi-optimal solutions. The worst experimental error is 1, which means that the optimal RS is in worst case greater by one register than the one computed by Greedy-k. The second experimentation is devoted to checking the e ciency of the value serialization heuristics in order to reduce the RS. We have also de ned and implemented in 16] an integer linear programming model to compute the optimal reduction of the RS with a minimum critical path increase (NP-hard problem). The total number of experimented DAGs is 144, where the number of nodes goes up to 80 and the number of values goes up to 76. In almost all cases, our heuristics manages to get the optimal solutions. Optimal reduced RS was in the worst cases less by one register than our heuristics results. Since RS computation in the value serialization heuristics is done by Greedy-k, we add its worst experimental error (1 register) which leads to a total maximal error of two registers. All optimal vs. approximated results are fully detailed in 16].
Since our strategies result in a good e ciency, we use them to study the RS behavior in DAGs. Experimentation on only loop bodies shows that the RS is low, ranging from 1 to 8. We have unrolled these loops with di erent unrolling factors going up to 20 times. The aim of such unrolling is to get large DAGs, increase the registers pressure and expose more ILP to hide memory latencies. We carried out a wide range of of experiments to study the RS and its reduction with with various limits of available registers (going from 1 up to 64). We experimented 720 DAGs where the number of of nodes goes up to 400 and the number of values goes up to 380. Full results are detailed in 17, 16] .
The rst remark deduced from our full experiments is that the RS is lower than the number of available registers in a lot of cases. The RS analysis makes it possible to avoid the registers constraints in code scheduling : most of these codes can be scheduled without any interaction with the register allocation, which decreases the compile-time complexity. Second, in most cases our heuristics succeeds in reducing it until reaching the targeted limit. In a few cases we lose some ILP because of the intrinsic register pressure of the DAGs : but since spill code decreases the performance dramatically because of the memory access latencies, a tradeo between spilling and increasing the overall schedule time can be done in few critical cases. Finally, in the cases where the RS is lower than the number of available registers, we can use the extra non used registers by assigning to them some global variables and array elements with the guarantee that no spill code could be introduced after by the scheduler and the register allocator.
Related Work and Discussion
Combining code scheduling and register allocation in DAGs was studied in many works. All the techniques described in 11, 6, 13, 7, 12] used their heuristics to build an optimized schedule without exceeding a certain limit of values simultaneously alive. The dual notion of the RS, called the register su ciency, was studied in 1]. Given a DAG, the authors gave a heuristic which found the minimum register need ; the computation was O(log 2 jV j) factor of the optimal. Note that we can easily use the RS reduction to compute the register su ciency. This is done in practice by setting R = 1 as the targeted limit for the RS reduction.
Our work is an extension to URSA 4, 5] . The minimum killing set technique tried to saturate the register requirement in a DAG by keeping the values alive as late as possible : the authors proceeded by keeping as many children alive as possible in a bipartite component by computing the minimum set which killed all the parent's values. First, since the authors did not formalize the RS problem, we can easily give examples to show that a minimum killing set does not saturate the register need, even if the solution is optimal 17]. Figure. 6 shows an example where the RS computed by our heuristics (Part b) is 6 where the optimal solution for URSA yields a RS of 5 (part c). This is because URSA did not take into account the descendant values while computing the killing sets. Second, the validity of the killing functions is an important condition to compute the RS and unfortunately was not included in URSA. We have proven in 17] that non valid killing functions can exist if no care is taken. Finally, the URSA DAG model did not di erentiate between the types of the values and did not take into account delays in reads from and writes into the registers le. In our work, we mathematically study and de ne the RS notion to manage the registers pressure and avoid spill code before the scheduling and register allocation passes. We extend URSA by taking into account the operations in both Unit and Non Unit Assumed Latencies (UAL and NUAL 15]) semantics with di erent types (values and non values) and values ( oat, integer, etc.). The formal mathematical modeling and theoretical study permit us to give nearly optimal strategies and prove that the minimum killing set is insu cient to compute the RS. Experimentations show that the registers constraints can be obsolete in many codes, and may therefore be ignored in order to simplify the scheduling process. The heuristics we use manage to reduce the RS in most cases while some ILP is lost in few DAGs. We think that reducing the RS is better than minimizing the register need : this is because minimizing the register need increases the register reuse, and the ILP loss must increase as consequence. Our DAG model is su ciently general to meet all current architecture properties (RISC or CISC), except for some architectures which support issuing dependent instructions at the same clock cycle, which would require representation using null latency. Strictly positive latencies are assumed to prove the pkill operation property (Lemma 1) which is important to build our heuristics. We think that this restriction should not be a major drawback nor an important factor in performance degradation, since null latency operations do not generally contribute to the critical execution paths. In the future, we will extend our work to loops. We will study how to compute and reduce the RS in the case of cyclic schedules like software pipelining (SWP) where the life intervals become circular.
