Abstract-This paper presents several theoretical and fundamental results on the register need in periodic schedules, also known as MAXLIVE. Our first contribution is a novel formula for computing the exact number of registers needed by a scheduled loop. This formula has two advantages: Its computation can be done by using a polynomial algorithm with Oðn lg nÞ complexity (n is the number of instructions in the loop) and it allows the generalization of a previous result [13] . Second, during software pipelining, we show that the minimal number of registers needed may increase when incrementing the initiation interval ðIIÞ, which is contrary to intuition. For the case of zero architectural delays in accessing registers, we provide a sufficient condition for keeping the minimal number of registers from increasing when incrementing the II. Third, we prove an interesting property that enables us to optimally compute the minimal periodic register sufficiency of a loop for all its valid periodic schedules, irrespective of II. Fourth and last, we prove that the problem of optimal stage scheduling under register constraints is polynomially solvable for a subclass of data dependence graphs, whereas this problem is known to be NP-complete for arbitrary dependence graphs [7] . Our latter result generalizes a previous achievement [13] which addressed data dependence trees and forest of trees. In this study, we consider cyclic data dependence graphs without taking into account any resource constraints. The aim of our theoretical results on the periodic register need is to help current and future software pipeliners achieve significant performance improvements by making better (if not the best) use of the available resources.
INTRODUCTION
S OFTWARE pipelining (SWP) is a common way to schedule innermost loops in order to extract a large amount of instruction-level parallelism (ILP). In addition to the inherent data dependence constraints, computing a periodic schedule of a loop must obey two main families of constraints. The first one is related to resource constraints that must be satisfied in order to avoid oversaturating the functional units of the underlying processor. The second family consists of register constraints: The computed periodic schedule must not require more registers than the ones available. The SWP must not only obey these two families of constraints but also maximize the execution rate (minimize the initiation interval (II)) of the loop. In this paper, we focus only on the register constraints and we do not consider any functional unit limitation nor any resource model.
Ideally, one should prefer to bring effective methods to minimize the II under a fixed number of available registers. Unfortunately, the literature focuses on the dual method: Given a fixed integral II, how can we minimize the register need (RN)? This is because, as far as we know until now, the RN is hard to minimize if an integral II is not fixed. Thus, many people provide heuristics to reduce the RN for a fixed integral II [9] , [10] , [15] , [17] , [24] . If II is assumed as a rational period, paper [14] provides a method for minimizing II with a limited number of registers. Assuming rational periods is another method of periodic scheduling which is distinct from the common SWP. In this paper, we consider integral periods. This paper investigates several fundamental aspects of minimizing the periodic RN in SWP. The ancestor problem of minimal RN in the case of basic blocks (acyclic schedules) profits from plenty of studies, resulting in a rich theoretical literature. Unfortunately, the periodic (cyclic) problem suffers somehow from fewer fundamental results. Our present fundamental results in this topic allow us to better understand the register constraints in periodic instruction scheduling and, hence, to help the community to provide better SWP heuristics and techniques in the future.
In order to be "optimal," SWP techniques should schedule the instructions of a loop in harmony with many constraints, such as data dependence constraints and target processor constraints. The usual processor limitations that are commonly taken into account are registers, functional units, instruction selection, and coding constraints. However, there are other hardware characteristics that are not currently considered in optimal SWP techniques (a far as we know): cache effects (variable loads latencies), memory disambiguation mechanisms, memory banking and interleaving, loadstore queues, dynamic speculation, dynamic register renaming, and so forth. We think that if all of these hardware constraints are modeled inside the same complex SWP, it will be hard to come up with a mathematical intelligibility of the SWP problem. Thus, studying SWP under data dependences and register constraints separately from the other resource constraints serves the purpose of separating complex problems to deduce some mathematical characteristics that are useful for writing better general SWP heuristics.
Our paper is organized as follows: Section 2 recalls formal notations and definitions about SWP and periodic RN. Section 3 finds a new formula for computing the RN of a scheduled loop which can be computed by a polynomial algorithm. Section 4 provides a sufficient condition so that the minimal RN does not increase when incrementing II; however, such a condition is proven in the case of architectures with zero delays in accessing registers, which reflects the most common architectures. This condition is used to show how the periodic register sufficiency (PRS) of a loop can be computed independently from any periodic schedule. Before the conclusion, Section 5 examines the problem of stage scheduling with register minimization in the special case of expression trees and in more general cases, which are data dependence graphs (DDGs) that assign a unique possible killer per variable.
BACKGROUND
We consider a simple innermost loop (without branches but with possible recurrences). It is represented by a DDG G ¼ ðV ; E; ; Þ such that . V is the set of the statements in the loop. The instance of statement u (an operation) of iteration i is denoted by uðiÞ and, when referring to an arbitrary iteration of a statement u, we simply write u. . E is the set of precedence constraints (flow dependences or other serial constraints). Any edge e has the form e ¼ ðu; vÞ, where ðeÞ is the latency of the edge e in terms of processor clock cycles and ðeÞ is the distance of the edge e in terms of the number of iterations. . A valid schedule must satisfy 8i; 8e ¼ ðu; vÞ 2 E :
We consider a target RISC-style architecture and we distinguish between statements and precedence constraints, depending upon whether they refer to values to be stored in registers or not:
1. V R V is the set of statements that produce values to be stored in registers. 2. E R E is the set of flow dependence edges through a register. The set of consumers (readers) of a value u 2 V R is therefore the set ConsðuÞ ¼ fv 2 V j ðu; vÞ 2 E R g:
In order to consider static issue VLIW processors in which the hardware pipeline steps are visible to compilers (we consider superscalar processors too), we assume that reading from and writing into a register may be delayed from the beginning of the schedule time and these delays are visible to the compiler (architecturally visible). We define two delay (offset) functions r and w , in which
the write cycle of u into a register is ðuÞ þ w ðuÞ:
r : V ! IN u7 ! r ðuÞj 0 r ðuÞ the read cycle of u from a register is ðuÞ þ r ðuÞ:
According to the semantics of superscalar processors (sequential semantics) and EPIC/IA64, r and w are equal to 0. Also, most of the VLIW processors have zero reading/ writing delays. However, few VLIW processors, such as Trimedia, have nonzero reading/writing delays.
The next section recalls the basic notations and definitions in SWP.
Software Pipelining
SWP is basically a scheduling method. It can be modeled by a function that assigns to each statement u a scheduling date (in terms of clock cycle) that satisfies the precedence constraints. The most common form, modulo scheduling, is defined by an II and the scheduling date u 2 IN for each operation uð0Þ of the first iteration. Operation u of iteration i is scheduled at time u þ i Â II. The total schedule time of one iteration of the original loop body is noted L, with L ! max u2V u , and II L is the total schedule time of the new loop kernel. We call L the duration (sometimes called the iteration length or time horizon). Fig. 1b is an example of an SWP schedule of the DDG shown in Fig. 1a in which the values and flow edges are drawn with bold lines. A pair of the labels ððeÞ; ðeÞÞ is associated with each edge e.
Any valid periodic schedule must satisfy 8e ¼ ðu; vÞ 2 E; u þ ðeÞ v þ ðeÞ Â II:
Classically, by adding all such inequalities (precedence constraints) along any cycle C of G, we find that II must be greater than or equal to max C P e2C ðeÞ P e2C ðeÞ
$ %
and that we will denote in the sequel as the minimal iteration interval (MII). In this paper, we ignore ResMII since we do not assume any resource model. Thus, MII in our text is equivalent to RecMII. Wang et al. [23] modeled the kernel of an SWP schedule as a 2D matrix by defining a column number cn and row number rn for each statement (see Fig. 1c ). This brings a new definition for SWP, which becomes a triple ðrn; cn; IIÞ. The row number rn of a statement u is its issue date inside the kernel. The column number cn of a statement u inside the kernel, sometimes called kernel cycle, is its stage number. The last parameter II is the kernel length (II). This triple formally defines the SWP schedule as
ðuðiÞÞ ¼ rnðuÞ þ II Â ðcnðuÞ þ iÞ;
where cnðuÞ ¼ b u II c and rnðuÞ ¼ u mod II. For the rest of this paper, we will write ¼ ðrn; cn; IIÞ to reflect the equivalence (equality) between the SWP scheduling function , which is defined from the set of statements to clock cycles, and the SWP scheduling function defined by the triple ðrn; cn; IIÞ.
Let AEðGÞ be the set of all valid SWP schedules of a loop G. We denote by AE L ðGÞ the set of all valid SWP schedules whose durations (the total schedule time of one original iteration) do not exceed L:
AEðGÞ is an infinite set of schedules, whereas AE L ðGÞ & AEðGÞ is finite. Bounding the duration L in the SWP scheduling allows, for instance, looking for periodic schedules with finite prologue/epilogue codes since the size of the prologue/epilogue codes is L À II and 0 II L. The next section recalls the notion of RN in periodic schedules.
Periodic Register Need
The value produced by the operation uð0Þ is written into a register at u þ w ðuÞ clock cycles starting from the execution date of the whole loop, which defines its birth date. The killers of this value are all the last scheduled consumers (readers). The value uð0Þ is dead after its last use(s) at a cycle, which we denote as
To generalize, the value u of the ith iteration ðuðiÞÞ is defined at the absolute time u þ w ðuÞ þ i Â II (starting from the execution date of the whole loop) and killed at the absolute time d ðuÞ þ i Â II. Thus, the end points of the lifetime intervals of the distinct operations of any statement u are all separated by a constant time that is equal to II. Given such fixed II, we can model the periodic lifetime intervals during the steady state by considering the lifetime interval of only one instance uðiÞ per statement, say, uð0Þ, which we will simply abbreviate by u.
In our model, we assume that a value written at instant c is alive one step later. This is not a limitation of the model but a choice. It does not alter the mathematical results of our study.
The acyclic lifetime interval (range) of the value u 2 V R is then equal to LT ðuÞ ¼ u þ w ðuÞ; d ðuÞ:
As can be seen, this interval is left open because we assume that the value is alive one step after its writing. For instance, the acyclic lifetime intervals of v1, v2, and v3 in Fig. 2 The lifetime of a value u 2 V R is the total number of clock cycles during which this value is alive according to the For instance, the lifetimes of v1, v2, and v3 in Fig. 2 are (respectively) two, three, and six clock cycles.
The periodic RN (also known in the literature as the register requirement or MAXLIVE) is the maximal number of values which are simultaneously alive in the SWP kernel. In order to clarify potential confusion, the reader must distinguish between two concepts about RN:
1. As we define in this paper, the RN is the exact maximal number of values that are simultaneously alive (also called MAXLIVE). 2. Sometimes, the RN refers to the number of registers used for the final SWP schedule (at register allocation and code generation step). For this case, MAXLIVE constitutes a lower bound of the RN. However, this lower bound for the final RN can always be reachable and equal to MAXLIVE, as proven in [2] , if we unroll the loop sufficiently or if we insert move instructions. If neither loop unrolling nor inserting move instructions are allowed, we may require MAXLIVE+1 registers to generate the code (the experimental evidence is established in [16] and proven later in [12] ). In the case of a periodic schedule, some values may be alive during several consecutive kernel iterations and different instances of the same variable may interfere. Previous important results by Hendren et al. [8] show that the lifetime intervals during the steady state describe a circular lifetime interval graph around the kernel: We "wrap" (roll up) the acyclic lifetime intervals of the values around a circle of circumference II and, therefore, the lifetime intervals become cyclic. We give here a formal definition of such circular intervals.
Definition 1 (Circular Lifetime Interval). A circular lifetime
interval produced by wrapping a circle of circumference II by an acyclic interval I ¼a; b is defined by a triplet of integers ðl; r; pÞ such that The set of all of the circular lifetime intervals around the kernel defines a circular interval graph, which we denote by CðGÞ. In this paper, we use the short term circular interval to indicate a circular lifetime interval and the term circular graph for indicating a circular lifetime intervals graph. Fig. 3a gives an example of a circular graph. The maximal number of simultaneously alive values is the width of this circular graph, that is, the maximal number of circular intervals that interfere at a certain point of the circle. For instance, the width of the circular graph in Fig. 3a is 4 . Fig. 2b is another representation of the circular graph. We denote by RN ðGÞ the periodic RN of the DDG G according to the schedule , which is equal to the width of the circular graph.
COMPUTING PERIODIC REGISTER NEED
Computing the width of a circular graph is straightforward. We can compute the number of simultaneously alive values at each clock cycle in the SWP kernel. This method is commonly used in the literature [9] , [10] , [15] , [17] , [24] . Unfortunately, it leads to a method whose complexity depends on the II. This factor is pseudopolynomial because it does not strictly depend on the size of the input DDG, but, rather, depends on the specified latencies in the DDG and on its structure (critical cycle). We want to provide a better method whose complexity depends only on the size of the DDG, that is, depends only on n, which is the number of statements (number of DDG vertices). For this purpose, we find a relationship between the width of a circular interval graph and the size of a maximal clique in the interference graph. 1 We are aware that other good polynomial methods for computing MAXLIVE may exist. However, our method brings a new formula that has two main advantages: First, it is formal and provably correct and, second, it is important for improving and generalizing the previous results [3] , [13] in Section 5.
In general, the width of a circular interval graph is not equal to the size of a maximal clique in the interference graph [22] . This is contrary to the case of acyclic intervals graphs, where the size of a maximal clique in the interference graph is equal to the width of the intervals graph. In order to effectively compute this width (which is equal to the RN), we decompose the circular graph CðGÞ into two parts:
1. The first part is the integral part. It corresponds to the number of complete turns around the circle, that is, the total number of value instances that are simultaneously alive during the whole steady state of the SWP schedule: P ðl;r;pÞ a circular interval p. 2. The second part is the fractional (residual) part. It is composed of the remainder of the lifetime intervals after removing all of the complete turns (see Figs. 3b and 3c). The size of each remaining interval is strictly less than II, which is the size of the SWP kernel. Note that, if the left end of a circular interval is equal to its right end l ¼ r, then the remaining interval after ignoring the complete turns around the circle is empty, l; r ¼l; l ¼ ;. These empty intervals are then ignored from this second part. Two classes of intervals that remain are listed as follows:
a. Intervals that do not cross the kernel barrier, that is, when the left end is less than the right end l < r. In Figs. 3b and 3c, v 1 belongs to this class. b. Intervals that cross the kernel barrier, that is, when the left end is greater than the right end, l > r. In Figs. 3b and 3c, v 2 and v 3 belong to this class. These intervals can be seen as two fractional intervals (l; II and 0; r) which represent the left and the right parts of the lifetime intervals. If we merge these two acyclic fractional intervals of two successive SWP kernels, we create a new contiguous circular interval. These two classes of intervals define a new circular graph. We call it a fractional [1] circular graph because the size of its lifetime intervals is less than II. This circular graph contains the circular intervals of the first class and those of the second class after merging the left part of each interval with its right part (see Fig. 3b ).
Definition 2 (Fractional Circular Graph). Let CðGÞ be a circular graph of a DDG G ¼ ðV ; E; ; Þ. The fractional circular graph, denoted by CðGÞ, is the circular graph after ignoring the complete turns around the circle:
CðGÞ ¼ fðl; rÞ j 9ðl; r; pÞ 2 CðGÞ^r 6 ¼ lg:
We call the circular interval ðl; rÞ a circular fractional interval. The length of each fractional interval ðl; rÞ 2 CðGÞ is less than II clock cycles. Therefore, the periodic RN becomes equal to
where w denotes the width of the fractional circular graph (the maximal number of values simultaneously alive).
Computing the first term of this formula (complete turns around the circle) is easy and can be computed in linear time (provided lifetime intervals) by iterating over the n lifetime intervals and adding the integral part of b lifetimeðuÞ II c. However, the second term of the formula is more difficult to compute in polynomial time. This is because, as stated before, the size of a maximal clique (in the case of an arbitrary circular graph) in the interference graph is not equal to the width of the circular interval graph [22] . In order to find an effective algorithmic solution, we use the fact that the fractional circular graph CðGÞ has circular intervals that do not make complete turns around the circle. Then, if we unroll the kernel exactly once to consider the values produced during two successive kernel iterations, some circular interference patterns become visible inside the unrolled kernel. For instance, the circular graph in Fig. 4a has a width equal to 2. Its interference graph in Fig. 4b has a maximal clique of size 3. Since the size of these intervals does not exceed the period II, we unroll the circular graph once, as shown in Fig. 4c . The interference graph of the circular intervals in Fig. 4d has a size of a maximal clique equal to the width, which is 2. Note that v2 does not interfere with v3 0 because, as said before, we assume that all lifetime intervals are left open.
When unrolling the kernel once, each fractional interval ðl; rÞ 2 CðGÞ becomes associated with two acyclic intervals I and I 0 constructed by merging the left and the right parts of the fractional interval of two successive kernels. I and I 0 are then defined as follows:
. If r ! l, then I ¼l; r, and I 0 ¼l þ II; r þ II. . If r < l, then I ¼l; r þ II and I 0 ¼l þ II; r þ 2 Â II.
Theorem 1. Let CðGÞ be a circular fractional graph (no complete turns around the circle exists). For each circular fractional interval ðl; rÞ 2 CðGÞ, we associate the two corresponding acyclic intervals I and I 0 . The cardinality of any maximal clique in the interference graph of all of these acyclic intervals is equal to the width of CðGÞ. Proof. Please see Appendix A, which can be found on the Computer Society Digital Library at http://computer. org/tc/archives.htm. t u Theorem 1 proves an important property that allows us to compute (1) in polynomial time. The second term of that formula, which is the width of the circular graph, can now be computed after unrolling the kernel once (linear time complexity) and then by computing the width of the acyclic fractional intervals graph. This can be done with a complexity of Oð2 Â n lg nÞ ¼ Oðn lg nÞ [6] . The first part of the formula, as stated before, can be computed in a linear time complexity (assuming that circular intervals are provided).
The result presented in this section shows an interesting formula, that is, (1) , that allows us to compute the exact RN of a scheduled loop by using a polynomial algorithm. This is a new aspect about SWP, where the usual methods of computing RN are pseudopolynomial. Also, this new method of RN computation will be used in Section 5 to generalize previous results [3] , [13] . Someone could argue that computing the periodic RN by traversing the II, even if it is pseudopolynomial, in practice, is very fast and simple. Such an argument is valid from the computer engineering point of view. However, this is not an acceptable claim from the computer science point of view for two main reasons:
1. If a method computing the periodic RN by traversing the II is fast in practice, we should (try) to formally prove that it would be fast for any input DDG. Usually, experiments are done on a finite set of nonrepresentative benchmarks and on typical machines and software setup. Thus, such experiments do not provide a general guarantee for the efficiency of a method. 2. According to the algorithmic theory, pseudopolynomial algorithms have somehow exponential algorithmic complexity. The algorithmic theory says that, unless we do not have a choice, polynomial time algorithms are to be preferred to pseudopolynomial methods. The next section investigates the problem of minimal RN in periodic schedules.
COMPUTING THE PERIODIC REGISTER SUFFICIENCY
The literature contains many techniques about reducing the periodic RN for a given fixed II. In this section, we want to compute the minimal RN for any valid SWP independently of II. We call it the PRS to distinguish it from the classical register sufficiency in basic blocks. We define it as P RSðGÞ ¼ min
2AEðGÞ RN ðGÞ: ð2Þ
Computing the PRS allows us, for instance, to determine if spill code cannot be avoided for a given loop: If R is the number of available registers and if P RSðGÞ > R, then there are not enough registers to allocate to any loop schedule. Spill code has to be introduced necessarily, independently of II.
The complexity of computing the register sufficiency in ILP codes (regardless of whether they are basic blocks or loops) remains an open problem. It was proven that computing an instruction ordering that minimizes the number of required registers is NP-complete in the case of sequential codes [19] , that is, when we compute a strict sequential execution order. If we do not restrict the schedule to being sequential, the problem is different. It was proven in [4] that the problem of (parallel) scheduling under register constraints is NP-complete under the condition that the total schedule time is bounded. As far as we know, there are no known results about the problem of scheduling parallel operations to minimize the number of registers (without spill and without resource constraints) without bounding the total schedule time. This section tries to give a formula that allows us to compute PRS for any SWP schedule. Since we are not able to compute the RN independently of II, an obvious method would be to compute the minimal RN under a fixed II and then iterate over all possible values of II until reaching a limit or when II ¼ L (the maximal allowed II). First, such a method is complex because computing the minimal periodic RN under a fixed II is NP-complete [4] . Second, it requires solving many optimization problems (one for each considered II).
The first step toward our goal is to give a sufficient condition such that the minimal RN under a fixed II would be greater than or equal to the one computed with II þ 1. The next section investigates this aspect.
Minimal Register Need versus Initiation Interval
It is intuitive that the lower the II, the higher the register pressure since more parallelism requires more memory. If we succeed in finding an SWP schedule that needs R registers and without assuming any resource conflicts, then it is possible to get another SWP schedule that needs not more than R registers with a higher II. Until now, such an assertion has not been proven. We show here that increasing the maximal duration L is a sufficient condition. Theorem 2 gives a sufficient condition so that the minimal RN does not increase when II increases. The usual intuition suggests that increasing II would decrease the register pressure [9] , [10] , [15] , [17] , [24] . The extant algorithms often attempt to deal with the excess register pressure by increasing II since, intuitively, less parallelism would require fewer registers. Some algorithms implicitly allow L to increase as well (although, sometimes, L is kept bounded to avoid hurting performance for short trip counts or to avoid long prologue/epilogue code). However, we have a counterexample demonstrating that increasing II does not necessarily reduce the RN if L is not also allowed to increase: It may even increase. Fig. 5 shows a counterexample. The first part presents the DDG of a loop extracted from the spice benchmark (SPEC95). The label of each edge e is the pair ððeÞ; ðeÞÞ. The second part of Fig. 5 plots the minimal RN for different fixed IIs. It has been computed by using an exact optimal integer linear programming approach, as presented in [21] (without any resource constraints). The maximal duration L has been fixed to the sum of all the latencies, that is, L ¼ 20 in this example. As can be seen, if L is fixed for all II, then the minimal RN can increase when incrementing II. This example shows that the minimal RN may not be a decreasing function of II, as is commonly believed. We have to allow L to increase when incrementing II. For instance, when II ¼ 12, we computed that the minimal RN is four registers L ¼ 20. If we want to guarantee that the minimal RN will not increase when II ¼ 12 þ 1 ¼ 13, we have to set (according to Theorem 2) a new maximal duration that is, we have to increase L by two clock cycles to give more freedom to the SWP scheduler so that it can decrease (or keep constant) the RN. Otherwise, we would require at least five registers as plotted. This section provided a relationship between the minimal RN and II which we will use in the next section to compute the register sufficiency.
Computing the Periodic Register Sufficiency
The PRS defined by (2) is called the absolute register sufficiency because it is defined for all valid SWP schedules belonging to AEðGÞ (an infinite set). In this section, we will compute PRS for a finite subset AE L ðGÞ AEðGÞ, that is, for the set of SWP schedules such that the duration does not exceed L. This is because many practical SWP schedulers assume a bounded duration L in order to limit the prologue/epilogue size. As we will show later, one can choose a value for L such that
RN ðGÞ:
Many techniques show how we can determine the minimal RN, given a fixed II [1] , [5] , [17] , [21] . If we use such methods to compute the PRS, we have to solve many combinatorial problems, one for each II, starting from MII to a maximal duration L. Fortunately, the following corollary states that it is sufficient to compute the PRS by solving a unique optimization problem, with II ¼ L, if we increase the maximal duration (the new maximal duration is denoted L 0 to distinguish it from L). Let us start by the following, which is a direct consequence of Theorem 2: Lemma 1. Let G ¼ ðV ; E; ; Þ be a DDG with zero delays in accessing registers. The minimal RN of all the SWP schedules with an II, assuming a duration of at most L, is greater than or equal to the minimal RN of all the SWP schedules with an II where
In other words, Corollary 1 proves the following implication:
where the value of L 0 is given by Corollary 1. Corollary 1 enables us to solve a unique problem of minimal RN under a fixed II ¼ L: The maximal duration must be increased with the proven recurrent sequence. If the initial L is sufficiently large, the computed PRS with II ¼ L is equal to the absolute PRS, that is, the minimal RN of any valid SWP of the loop. If L is not sufficiently large, then we compute the PRS of the subset AE L ðGÞ, which may be greater than the absolute PRS. Fig. 6 draws the theoretical asymptotic curves to explain the meanings of Corollary 1. If we fix L as a maximal duration for all values of II, the minimal RN is not always a decreasing function of II. At a certain value of II, the minimal RN may increase if the duration L is not relaxed (the evidence is shown in Fig. 5 ).
We prove in Theorem 2 that the curve is a nonincreasing function if the maximal duration L 0 is increased when we increment II (Lemma 1 and Corollary 1). If L is sufficiently large, the minimal RN at the point II ¼ L is exactly the absolute PRS. Such appropriate L is necessarily finite because PRS is a finite integer and, hence, there exists necessarily an SWP schedule that requires PRS registers. Formally computing a suitable finite large L remains an open problem. We think that L ¼ P u2V latencyðuÞ would be convenient. It corresponds to the case when lifetime intervals may constitute a sequence of chains inside the SWP kernel. However, in practical cases, L should be bounded by the compiler (or by the user) in order to bound the prologue/epilogue code size. Thus, Corollary 1 gives us the way to compute the PRS for the class of schedules that belong to AE L ðGÞ.
STAGE SCHEDULING UNDER REGISTER CONSTRAINTS
Stage scheduling, as studied in [3] , is an approach that periodically schedules loop operations, given a fixed II and a fixed reservation table (that is, after satisfying the resource constraints). In other terms, the problem is to compute the minimal RN, given a fixed II and fixed row numbers rn, whereas column numbers cn are left free (that is, variables to optimize). This problem has been proved NP-complete by Huard in [7] . A careful study of his proof allows us to deduce that the complexity of this problem comes from the fact that the last users of the values are not known before scheduling the loop. Mangione-Smith et al. [13] proved that stage scheduling under register constraints has a polynomial time complexity in the case of data dependence trees and a forest of trees. This section proves a more general case than [13] by showing that, if the killer is known before scheduling, as in the case of expression trees, then stage scheduling under register constraints is a polynomial problem. We will see that we can deduce it by using the formula of the RN given in (1). Before proving this general case, we first start by proving it for the case of trees (for clarity). Let us begin by writing the formal problem of SWP with RN minimization:
Minimize

RN ðGÞ Subject to :
v À u ! ðeÞ À II Â ðeÞ; 8e ¼ ðu; vÞ 2 E:
This standard problem has been proven to be NP-complete in [4] , even for trees and chains. Eichenberger et al. [3] studied a modified problem by considering a fixed reservation table. By considering the row and column numbers u ¼ rnðuÞ þ II Â cnðuÞ, fixing the reservation table amounts to fixing the row numbers while letting the column numbers as free integral variables. Thus, by considering the given row numbers as conditions, (3) 
It is clear that the constraints matrix of (5) constitutes an incidence matrix of the graph G. If we succeed in proving that the objective function RN ðGÞ is a linear function of the cn variables, then (5) becomes an integer linear programming system with a totally unimodular constraints matrix and, consequently, it can be solved with polynomial time algorithms [18] . Since the problem of stage scheduling defined by (5) has been proven to be NP-complete, it is evident that RN ðGÞ cannot be expressed as a linear function of cn for an arbitrary DDG. In this section, we restrict ourselves to the cases of DDGs where each value u 2 V R has a unique possible killer kðuÞ, such as the case of expression trees. In an expression tree, each value u 2 V R has a unique killer k u that belongs to the same original iteration, that is, ððu; k u ÞÞ ¼ 0. With this latter assumption, we will prove in the remainder of this section that RN ðGÞ is a linear function of column numbers. Let us begin by recalling the formula of RN ðGÞ:
RN ðGÞ ¼ X ðl;r;pÞ2CðGÞ
The first term corresponds to the total number of turns around the circle, whereas the second term corresponds to the maximal fractional intervals that are simultaneously alive (the width of the circular fractional graph). We set P ¼ P ðl;r;pÞ2CðGÞ p and W ¼ wðCðGÞÞ. We know that 8ðl; r; pÞ 2 CðGÞ, the circular interval of a value u 2 V R , and its number of turns around the circle is p ¼ b Since each value u has a unique possible killer k u belonging to the same original iteration (the case of expression trees)
Here, we succeed in writing P ¼ P p as a linear function of column numbers cn since rn and II are constants in (5) . Now, let us explore W . The fractional graph contains the fractional intervals fðl; rÞjðl; r; pÞ 2 CðGÞg. Each fractional interval ðl; rÞ of a value u 2 V R depends only on the row numbers and II as follows:
. Left end:
. Right end:
As can be seen, the fractional intervals depend only on row numbers and II, which are constants in (5). Hence, W , which is the width of the circular fractional graph, is a constant too. From all of the previous formulas, we deduce that
Equation (7) 
The constraints matrix of (8) describes an incidence matrix, so it is totally unimodular. It can be solved with a polynomial time algorithm. This section proves that stage scheduling of expression trees is a polynomial problem. Now, we can consider the larger case of the DDGs assigning a unique possible killer k u for each value u. Such a killer can belong to a different iteration k ¼ ðu; k u Þ. Then, the problem of stage scheduling in this class of loops also remains polynomial as follows:
1. If the DDG is acyclic, then we can apply a loop retiming [11] to bring all of the killers to the same iteration. Thus, we come back to the case similar to expression trees studied in this section. 2. If the DDG contains cycles, it is not always possible to shift all of the killers to the same iteration. Thus, by including the constants k in the formula, P becomes equal to
Since II and the row numbers are constants, W remains a constant, as proven by the following formulas of fractional intervals:
Consequently, RN ðGÞ remains a linear function of column numbers, which means that (8) can still be solved via polynomial time algorithms (usually with network flow algorithms).
Our result in this section is more general than expression trees. We extend the previous result [13] in two ways. Fig. 7 shows some examples, where all edges are flow dependences labeled by the pairs ððeÞ; ðeÞÞ.
1. Cyclic DDGs. Our result takes into account cyclic DDGs with a unique killer per value. As an example, Fig. 7a is a cyclic DDG with a unique possible killer per value. Such a DDG is not considered in [13] because it is cyclic, whereas it is neither a tree nor an acyclic DDG. 2. Acyclic DDG. Our result also takes into account acyclic DDGs with a unique possible killer per value, which are not necessarily trees or a forest of trees. For instance, Figs. 7b and 7c are examples of acyclic DDGs, where every node has a unique possible killer (because of the transitive relationship between nodes). These DDGs are not trees. Analyzing such a unique killer relationship in general acyclic DDGs can be done by using the so-called potential killing relation, which has been formally defined in [20] . In Fig. 7b , we have the following unique killers: kðaÞ ¼ e, kðbÞ ¼ c, kðcÞ ¼ d, and kðdÞ ¼ e. In Fig. 7c , we have the following unique killers: kðaÞ ¼ e, kðbÞ ¼ c, kðcÞ ¼ e, and kðdÞ ¼ e. All of these killing relationships can be deduced by analyzing the potential killing relation of the DAG [20] .
CONCLUSION
The work presented in this paper uses formal methods and reasoning to prove new interesting assertions in the problem of minimizing the periodic RN in periodic scheduling. The first contribution brings a novel polynomial method for computing the exact RN of an already scheduled loop ðOðn lg nÞ, where n is the number of statements). The complexity of the existing methods depends on II, which is a pseudopolynomial factor. Our second contribution provides a sufficient condition so that the minimal RN under a fixed II does not increase when incrementing II. We give an example to show that it is sometimes possible for the minimal RN to increase when II is incremented. Such a situation may occur when the maximal duration L is not relaxed (increased). This fact contradicts the general thought that incrementing II would require fewer registers (unless the constraint on L is loosened).
Guaranteeing that RN is a nonincreasing function versus II when relaxing the maximal duration allows us to now easily write the formal problem of scheduling under register constraints instead of scheduling with register minimization, as is usually done in the literature. Indeed, according to our results, we can finally apply a binary search on II. If we have R, which is a fixed number of available registers, and since we know how we can increase L so that the curve of RN versus II becomes nonincreasing, we can use successive binary search on II until we reach an RN below R. The number of such binary search steps is at most log 2 ðLÞ.
Our third contribution in this paper proves that computing the minimal RN with a fixed II ¼ L is exactly equal to the PRS if L is sufficiently large, that is, the minimal RN of all valid SWP schedules. Computing the PRS allows us to check, for instance, if introducing spill code is unavoidable when the PRS is greater than the number of available registers.
Although stage scheduling under register constraints for arbitrary loops is an NP-complete problem, our fourth and last contribution proves that stage scheduling with register minimization is a polynomial problem in the special case of expression trees and, generally, in the case of DDGs, providing a unique possible killer per value. This generalization is made possible, thanks to our new polynomial method of RN computation. This paper proposes new open problems. First, an interesting open question would be to provide a necessary condition so that the RN would be a nonincreasing function of II. Second, in the presence of architectures with nonzero delays in accessing registers, is Theorem 2 still valid? In other words, can we provide any guarantee that the minimal RN in such architectures does not increase when incrementing II? Third, we have shown that there exists a finite value of L such that the PRS assuming a maximal duration L is equal to the absolute PRS without assuming any bound on the duration. The open question is how can we compute such an appropriate value of maximal duration. Fourth and last, we require a DDG analysis algorithm to check whether each value has only one possible killer. We already have published such an algorithm for the case of directed acyclic graphs [20] , but the problem here is to extend it to cyclic graphs.
ACKNOWLEDGMENTS
Most of the research results of the current paper (except in Section 5) were found thanks to the valuable support of Christine Eisenbis from INRIA. The result in Section 5 was found thanks to the support of Professor William Jalby from the University of Versailles Saint-Quentin (UVSQ). The author wishes to thank Alain Darte from the Ecole Normale Supérieure de Lyon (ENS-Lyon) for his scientific influence. Finally, great advice and corrections have been made by colleagues and anonymous reviewers. This work was partially supported by the French National Research Agency (ANR MOPUCE project). 
