To sustain the increases in processor performance, embedded and real-time systems need to find the best total schedule time when compiling their application. The optimal acyclic scheduling problem is a classical challenge which has been formulated using integer programming in lot of works. In this paper, we give a new formulation of acyclic instruction scheduling problem under registers and resources constraints in multiple instructions issuing processors with cache effects. Given a direct acyclic graph G = (17, E), the complexity of our integer linear programming model is bounded by O([V[ ~) variables and O([E[+[V[ ~) constraints.
INTRODUCTION
Current compilers try to take benefit from the instruction level parallelism (ILP) present in nowadays processors. Multiple operations are issued in the same clock cycle to increase the throughput of the executed operations. Completing a computation in the shortest time is a scheduling problem constrained by many factors. The most important ones are the data dependencies, the availability of the hardware features and the memory hierarchy constraints. This latter include the registers constraints and the cache effects. While the registers constraints impose the fact that the number of values simultaneously alive must not exceed the number of available registers, the cache effects are different: in fact, the caches misses are only a source of performance bottlenecks because a miss penalty may stall the processor. Furthermore, the cache behavior is difficult to predict statically, making the verification and optimization of real time appliPermission to make digital or hard copies of all or part of this work for pcl~onal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists. requires prior spceific permission and/or a fee. CODES 01 Copenhagen Denmark Copyright ACM 2001 1-58113-364-2/01/04...$5.00 cation harder. In our formulation, we give a first approach to handle compulsory (cold start) cache misses where a the memory access operations exhibit some spatial or temporal locality [10] .
The theoretical studies on scheduling reveal that integrating the resources constraints [2] or the registers constraints [4] are two NP-complete problems. Combining scheduling under both the registers and resources constraints become a complex task where the general compilers use some heuristics to get an optimized schedule in polynomial time complexity. However, the embedded and real time systems can require the optimal (best) schedule. We have to write a "good" formulation of the problem in order to reduce the resolution time. Many works have been done using integer linear programming (intLP) models [15, 7, 8, 9, 4, 1, 5, 3] . In our work, we present a new formulation of acyclic scheduling such that the complexity of the model generated is lower than these existing techniques while we include some cache optimizations, like we will explain in the end of this paper. Our formulation must reduce the resolution time since we considerably reduce the number of variables and constraints in the generated intLP model. This paper is organized as following. We first present the model of the targeted processors in Sect. 2 and the direct acycllc graph (DAG) to be scheduled in Sect. 3: in our study, we assume heterogeneous FUs, more than one register type, and delayed latencies of writing into and reading from registers. The problem of acycllc scheduling is briefly recalled in Sect. 4. After, we define some intLP modeling techniques in Sect. 5. We use these techniques to write our intLP formulation in Sect. 6. We present some achieved work in this field in Sect. 7 and conclude by our remarks and perspectives in Sect. 8.
PROCESSOR DESCRIPTION
An ILP processor [12] takes benefit from the inherent parallelism in the instructions flow and issues multiple operations per clock cycle thanks to the pipelined execution and the presence of multiple functional units (FUs). An operation can be executed on one (or more) functional units (FU). We model the complex behavior of the execution of the o1>-erations on the FUs by the reservation tables. We attach to each instruction a reservation table (RT) to describe at which clock cycle a FU is busy due to the execution of this instruction on it. ART consists of a two-dimensional table, where the number of lines is the latency of the operation, and the columns consists of the set of FUs. Given a RT of an instruction u, ~7"~(c,q) = 1 means that u executes on the FU q during the clock cycle c after its issuing.
The target processor 7 ~ is described by T the set of its registers types (lloat, int, etc.), its hardware resources, and the set of instructions which execute on these resources. The hardware ressources are the set of the FUs Q = {q,, • • • , qM } such that Nq is the number of copies of the FU q E Q. We associate to each instruction u its reservation table 7~T~.
DAG MODEL
A DAG G = (IF, E, 5) consists of a set of operations V, and a set of arcs E which contains the data dependences between the operations with any other precedence constraints. Each operation u has a latency 5(u). We assume one sink operation .L in G which reflects the total schedule time : if there is more than one sink node, we add the virtual node _L with an arc e from each sink s to .1_ with 5(e) = lat(s). A valid schedule of G is a positive integer function a which associates to each operation u an issue time a(u). Any acyclic schedule cr of G must ensure that :
In this paper, we consider that each operation u E V writes into at most one register of a type t E T. The operations which define multiple values with different types are accepted in our model itf they do not define more than one value of a certain type. We denote by u t the value of type t defined by the operation u. We also consider the following sets :
1. VR,, is the set of the values of type t E 7"; 2. ER,, is the set of the flow dependency arcs through the values of type t E 7". If there is some values not read in the DAG, or are still read after leaving this DAG, these values have to be kept in registers. We consider then that there is a flow arc from these values to .1_ ;
Finally, we consider that reading from and writing into a register can be delayed from the beginning of the schedule time (VLIW case). We define the two delay functions 5r,, and ~., such that : 
ACYCLIC SCHEDULING PROBLEM
A valid schedule a of G is first constrained by the inherent data dependency relations between the operations or any other serial constraints. The target architecture limitations impose other constraints which are the limited number of resources constraints and registers.
Resources Constraints
The resources constraints are simply the fact that two operations must not execute simultaneously on the same FU, i.e. the total number of operations which execute on a FU q during a clock cycle c must not exceed Nq the number of the FU copies. By using the reservation tables, an operation u executes on a FU q during a clock cycle c iff 7~Tu[ca(u), q] = 1. Formally, the resources constraints are:
Registers Constraints
A value u t 6 V~.t is alive at the first step after the writing of u t until its last reading (consumption). The set of the consumers of a value u t 6 VR,t is the set of the operations which read it :
Co.8(n') = iv/~(n, v) e F.~,,}
The last consumption of a value is called the killing date and noted ;
W' e V~., kill(,.,') :
~Econs(~ i)
We assume that a value written at a clock cycle c in a register is available one step later. That is to say, if operation u reads from a register at & clock cycle c while operation v is writing in it at the same clock cycle, u does not get v's result but gets the value that was previously stored in that register. Then, the lifetime interval LT~e, of the value u t is ]a(u)+ 5~,,(u),kill(u')]. Given the lifetime intervals of all the values, the number of registers of type t needed to store all the defined values is the maximum number of values of type t that are simultaneously alive. We call this number the register need (requirement) of the schedule a, and we note it RN~(G). This register need is computed by building the indirected interference graph H~" = (VR.i,E), such that u* and v* are adjacent itf they are simultaneously alive, i.e. their lifetime intervals interfere. Then, the maximal number of values simultaneously alive is the cardinality of the mliximal clique (complete subgraph) of H~.
Since the number ~, of available registers of type t is limited in the target processor, we need to find a schedule which doesn't need more than :R, registers :
If such schedule doesn't exist, spill code has to be generated, i.e. we must store some values in memory rather than in registers. Spilling increases the total schedule time because it inserts new operations and the spilled data may cause cache misses. We do not handle spill code in this paper.
Cache Effects
In the area of fine grain scheduling, the cache effects are rarely taken into account because their behavior differ from one platform to another. Furthermore, reducing the cache effects may require more registers to issue more operations during the miss stall cycles, and sometimes may require extensive code size expansion due to loop unrolling to exhibit more ILP. To exploit this [LP, the memory load which causes a cache misse must be issued well ahead of the operation which requires the loaded data in order to reduce the cache miss stall cycles to a minimum. The scheduling method used in this paper is based on this technique where we try to cover the compulsory misses if a subset of memory loads access to the same cache line.
Given some memory loads operations accessing the same cache line, the first issued load causes a cache compulsory miss and brings the entire line into the cache, while the subsequent access to the loaded cache line are hits. To fix the ideas, we assume the following scenario. We call a leading cache effect [14] the penalty for a miss reference, and we note it Ice t . A subsequent reference to the same cache line suffers a trailing cache effect tce due to the latency of fully servicing the miss: the requested data which causes the miss bypass the cache and goes directly from the memory bus to the CPU, while the subsequent hits must wait tce cycles for loading the whole cache line into the cache.
According to above, the cache effects make the memory operations latencies variable according to the schedule. There is an inter-dependence between the schedule and the cache effects. For instance, suppose that three memory loads a,b,c access the same cache line. If a is scheduled before b and c, then this load is an essential (compulsory) miss which can not be eliminated. The latency of a must be set to 5(a) = Ice ff we want to avoid stalling the processor. To eliminated the trailing cache effects of b and c, we must issue them after the schedule time of a with at least (Ice +tce) clock cycles.
INTEGER LINEAR PROGRAMMING
An integer linear programming problem (intLP) [11] is to solve : maximize (or minimize) cx subject to Ax = b with e,x E N ~ : x ~ 0, and Aisan (m × n) constraints matrix. This is the standard formulation. In fact, we can use other linear constraints (_<, _>, <, >,--).
Logical Operators
Intrinsically, an intLP model defines the conjunctive operator A. Given two constraints matrix A and A', saying By introducing a binary variable a E {0, 1}, this disjunction is equivalent to :
where g and h are two known non null finite lower bounds for g a~d h resp. We generalize to an arbitrary number of constraints in an n-disjunctive formula v, :
Since the dichotomy operator V is associative, we group the constraints two by two from left to right. There is (n -1) 1This latency depends on the memory access latency and the memory bus bandwidth.
internal 
EQUIMINMAX FORMULATION
In this section, we define a new formulation of scheduling problem using integer linear programming (intLP). We named it EquiMinMax because it uses the linear constraints which express the equivalence relation (~=~) and the functions minn and max..
Basic Variables and Objective Function
For any operation u E V, we define an integer variable a. which computes the schedule time. The objective function of our model is to minimize the total schedule time i.e.
Minimize a.L
The first linear constraints describe the precedence relations. For any operation excluding the memory access, the latencies ~(u) are known statically. Let ~ be the set of memory (load) operations in G. The latency of these operations depends on the schedule time since this latter determines if a load is a compulsory miss or not. So, we need to define an integer variable 5~ for each load operation representing its latency which is set to a miss penalty lee iff u is a cache • <r~ <_ ~ = T -LonguestPathFrom(u) is the "as late as possible" schedule time according to the worst total schedule time T;
Registers Constraints

Interference Graph
The lifetime interval of a value u t of type t is
LT,, =]a, + 5~.t(u), max (a~ + ~.t(v))] v6con~(t~')
We define for each value u t the variable k.~ which computes its killing date. The number of k.~ variables is O(]V~,tD. Since the domain of our variables is bounded, we know that k., is bounded by the two following finite schedule times :
.2, < k~,, < k,,"'~ where
• ku, = at, + 5~.~(u) is the first possible definition date of u t ;
• k.---7 = max.coo..(., ) (~';+5.,t (v)) is the latest possible killing date of u t.
~The case where no ILP is exploited.
We use the maxn Ymear constraints to compute kt,, like explained in Sect. (IV~,,I × (IVR,,I-1))/2. LT., N LTut = ¢ means that one of the two lifetime intervals is "before" the other, i.e.
LTu, ~ LTv, V LT,,t -4 LT,~
where -~ denotes is the precedence operator ("before") in the interval algebra. Then, we have to express : s'~., 
Maximal Clique in the Interference Graph
The maximum number of values of type t simultaneously alive corresponds to a maximal clique in Ht = (VR,t,gt), where (u t, v t) E gt iff their lifetime intervals interfere (s~,~ = 1). For simplicity, rather to handle the interference graph itself, we prefer considering its complementary graph H~ = (VRa,g~) where (ut, v ¢) E g~ if[ their lifetime intervals do not interfere (s~,~ = 0). Then, a maximal clique in Ht corresponds to a maximal independent set s in H~.
To write the constraints which describe the independent sets (IS), we define a binary variable xat E {0,1} for each value z., E VR,t such that x., = 1 iff u t belongs to an IS of H~. We must express in the model the following linear Slt is a subgraph such that there is no two adjacent nodes.
The registers constraints are the fact that any set of values simultaneously alive must not exceed the number of available registers ~t. Thereby, we write in the model :
There is o(17"l) --o(1) such constraints.
Cache Effects
In this section, we show how to model the compulsory cache misses and how they influence the schedule. We start by grouping the memory access operations into subsets Yi~ _C Vt, such that all the operations belonging to the same subset Vt, access to the same cache line i (according to the cache line boundaries [10] ). so V= 0 = {a, b, c}. The first issued load in a subset Vq causes a cache miss. Its latency must be changed to Ice. The remaining operations within that subset have a hit latency while their issue time must be delayed at least with (lee +tce) like explained in Sect. 4.3.
To identify which load operation is being scheduled first and causes a miss, we define a variable mi for each subset V=, which holds the first (minimal) issue time :
We use the linear expression of mirln explained in Sect. 5. 
Resources Constraints
Conflicting Graph
The resources constraints are handled by considering for each FU an indirected graph Fq = (V,£q) which represents the conflicts between the instructions on a FU q E Q. For any couple of operations, (u, v) E £q iff u and v are in conflicts on q. Any clique in Fq represents the set of operations which conflict on q at the same time. So, any clique must not exceed Nq the number of copies of the FU q. We define a binary variable fq,v E {0, 1} such that f=q,~ ----1 ill there is a conflict between u and v on the FU q. Given the RT of two operations u and v, we can deduce when a structural hazards occurs on the FU q. The general formulation of the conflicting variables is the disjunction of all the cases where a conflict on the FU occurs.
Let Urea be the set of clock cycles in the reservation table of u where the FU q is used by u :
The set of all cases where two operations conflicts on a FU q are described by the cartesian product Un,q (~ U,,,q 
4.2 Maximal Click in the Conflicting Graph
For simplicity, rather than considering the conflict graph Then, a clique in Fo becomes an independent set in F~.
We define a binary variable yq E {0, 1} for each operation u such that yq ----1 iff u belongs to an IS of F~. We write in the intLP model the linear constraints of IS :
Vq E Q V couple u, v E g yq -I-y~ < 1 g==~ fq~ --0
We use the linear constraints of the equivalence (Sect. 5.1) by introducing a binary variable h E {0, 1}. There is O(1[2 x IVI x (Ivt -1)) binary variables h for each FU (one for each couple of operations) and 0(2 x {V[ x (IV[ -1)) linear constraints to describe the IS. The resources constraints are the fact that the cardinality of the any independent set in F~ must not exceed Nq. We write in the model :
Vq E Q E yq <_ Nq u(~v
There is O(IQD = O(1) such linear constraints.
RELATED WORK AND DISCUSSION
Acydic scheduling under registers and resources constraints is a classical problem where lot of works have been done. An intLP formulation (SILP) was defined in [15] to compute an optimal schedule with register allocation under resources constraints. The complexity of this model is bounded by O(WI 2) variables and O(IVI ~) constraints. However, this formulation does not introduce registers constraints, i.e. it does not limit the number of values simultaneously alive. Other formulations [7, 9] introduced registers constraints. The number of variables was O(]VI 2) but the number of the linear constraints grown exponentially due to registers constraints.
A polynomial formulation for the registers constraints was defined in [4] with a complexity of O(T x WI) variables and O(IEI + T x IVI) constraints. Similar approaches minimized the register requirement for the exact cyclic scheduling problem (software pipelining) under registers and resources constraints [1, 5, 3] . It is easy to rewrite these intLP models to solve the acyclic scheduling problem. All these formulations had a complexity which depended on the worst total schedule time T. Indeed, they define a binary variable a~.¢ for each operation u and for each execution step c during the whole execution interval [0, T]. an.~ is set to 1 iif the operation u is scheduled at the clock cycle c. The complexity of their models was dearly bounded by O(T x [V]) variables and O(IEI + T x IVl) constraints. In fact, the factor T can be very large in real codes since it depends on the input data itself (critical paths and specified operations latencies). We think that a complexity must depend only on the amount of input data and not on the date itself. Otherwise, the resolution time would not scale very well. For instance, if a memory operation is always a cache miss, then we change its static specified latency to a memory access (~ 100) in order to better exploit free slots during scheduling. The number of variables and constraints generated with all these techniques is multiplied by a factor of hundred, while the size of our model does not change anymore. The coefficients introduced by our formulation in the final constraints matrix are all bounded by T and -T, which is the case of the coefficients in the models defined in [1, 4, 5, 4] . If T is very huge, the resolution process can be difficult because of computational overflows [11] . Since EqniMax reduces the size of the model, resolving an EquiMax model is less critical than any one of the cited techniques.
CONCLUSION
In this work, we give an intLP formulation of the optimal scheduling under resources and registers constraints with cache effects. The FUs can have a complex and heterogeneous usage pattern and are modeled by reservation tables. We handle multiple registers types and delayed read from and write into the registers. In this work, we reduce the cache effects caused by the compulsory misses. The complexity of our model is polynomial on only the size of the input DAG. Theoretically, our formulation must reduce considerably the exact resolution time. In the future, we will try to model the capacity and conflict misses and extend our formulation to cyclic scheduling (software pipelining), where the lifetime intervals of the values and the resources usage patterns become cyclic.
