Abstract
Introduction
ILP (Instruction Level Parallelism) processors are being increasingly used in embedded Systems. Examples include Texas Instruments' TI C6, StartCore's SC140, Philips' Trimedia and HP and STMicroelectronics' Lx [6] .
A typical ILP processor consists of multiple parallel functional units of different types. An instruction can be executed only on a functional unit of the same type. Typically, the execution of an instruction takes one processor cycle. However, there are often delays of one or more processor cycles between instructions. These delays, called latencies, arise primarily because of off-chip communication and pipelining architecture. For example, if instruction v i precedes instruction v j and the latency between v i and v j is k (cycles), then instruction v j can be executed only if k cycles has elapsed after the completion of v i . Instruction scheduling is a key problem in an optimising compiler for ILP processors. In non-real-time applications, the objective of instruction scheduling is to find a shortest schedule for a set of instructions. This problem is NP-complete even if the target processor has only one functional unit and latencies can be arbitrarily large [12, 13] .
In real-time systems, instructions are subject to timing constraints. Typical timing constraints include release times and deadlines. For example, in CNC systems [7] , the output to motors must be sent at particular times to maintain high positioning accuracy of the machine tool. A number of researchers have studied the problem of scheduling timeconstrained instructions [1] [2] [3] [4] [5] . Palem and Simon [4] studied the problem of scheduling instructions with individual deadlines on an ILP processor with multiple identical functional units. Their algorithm is guaranteed to find a feasible schedule in several special cases. Wu and Jaffar [2] 1 proposed an efficient algorithm for scheduling instructions with individual deadlines on an ILP processor with multiple functional units of different types. Their algorithm is guaranteed to find a feasible schedule in several special cases. Leung et al [1] proposed a polynomial-time algorithm for scheduling instructions with individual release times and deadlines on an ILP processor with multiple identical functional units.
In this paper, we propose a fast algorithm for scheduling instructions with individual release times and deadlines on an ILP processor with multiple functional units of different types. Our algorithm is guaranteed to find a feasible schedule whenever one exists in the following special cases: 1) one functional unit, arbitrary precedence constraints, latencies in {0, 1}, integer release times and deadlines; 2) two identical functional units, arbitrary precedence constraints, latencies of 0, integer release times and deadlines; 3) multiple identical functional units or multiple functional units of different types, monotone interval-ordered graph, integer release times and deadlines; 4) multiple identical functional units, in-forest, equal latencies, integer release times and deadlines. In case 1), our algorithm improves the existing fastest algorithm [3] from O(n 2 log n) + min{O(ne), O(n 2.376 )} to min{O(ne), O(n 2.376 )}, where n is the number of instructions and e is the number of edges in the precedence graph. In case 2), our algorithm improves the existing fastest algorithm [1] from O(ne + n 2 log n) to min{O(ne), O(n 2.376 )}. In case 3), no polynomial time algorithm for multiple functional units of different types was known before.
The main idea of our algorithm is computing a tighter deadline called the l max (v i )-successor-treeconsistent deadline for each instruction v i , where l max (v i ) is the maximum latency between v i and all its immediate successors. Given a problem instance P , the l max (v i )-successor-tree-consistent deadline of an instruction v i is the upper bound on its latest completion time in any feasible schedule for the relaxed problem P (v i ) where the precedence-latency constraints are represented by the l max (v i )-successor tree which is a subset of the original precedence-latency constraints. To make it faster to compute the l max (v i )-successor-tree-consistent deadline for each instruction v i , we use a number of techniques, namely, forward scheduling, backward scheduling, disjoint set union-find and binary search.
Model and Definitions
The target ILP processor M has m functional units
The number of the functional units of type R i is m i . An instruction of type R i can be executed only on a functional unit of the same type. The execution of each instruction takes one processor cycle. A latency exists between two instructions with direct dependency. The precedence-latency constraints are represented by a weighted DAG G = (V, E, W ) where V denotes the set of all instructions, E the set of precedence constraints and W the set of all latencies. In addition, each instruction may have a pre-assigned release time and a preassigned deadline. If an instruction has no pre-assigned release time, its release time is set to 0. If an instruction has no pre-assigned deadline, its deadline is set to the largest pre-assigned deadline.
The problem of scheduling instructions with individual release times and deadlines on an ILP processor is described as follows. Given a problem instance P : a set
instructions, where each instruction has a type R(v i ) ∈ {R 1 , R 2 , · · · , R w } , a set of precedence-latency constraints in the form of a weighted DAG G = (V, E, W ), where 1. Precedence-latency constraints:
2. Release time and deadline constraints:
3. Resource constraints: For each type R i , 1) an instruction v j of type R i can be executed only on a functional unit of type 
Given a problem instance P , the edge-consistent release times and the edge-consistent deadlines of all instructions can be computed in O(e) time by using breadth-first search, where e is the number of edges in the precedence graph.
Definition 2.2. Given a non-negative integer
In this paper, all time points and the two endpoints of any time interval are non-negative integer. Intuitively, all instructions of type R i in a forbidden interval with respect to R i fully occupy the forbidden interval and cannot be scheduled outside the forbidden interval in any feasible schedule. As a result, no other instruction can be scheduled in this forbidden interval. Forbidden intervals are used to make it faster to compute the l max (v i )-successor-tree-consistent deadline for each nonsink instruction v i . All maximum forbidden intervals can be computed in O(n) time if we keep two lists of all instructions sorted in non-decreasing order of their release times and in non-decreasing order of their deadlines, respectively.
An interval-ordered graph [14] is a DAG G = (V, E),
where V is a set of intervals in the real line, 
, is typically tighter than its pre-assigned deadline. Specifically, given a problem instance P and an instruction v i , if v i is a sink instruction, then d i is equal to its pre-assigned deadline; otherwise, d i is the upper bound on its latest completion time in any feasible schedule for the relaxed problem instance P (v i ) which has the same set of instructions as in P with the following constraints:
• Precedence-latency constraints:
• Release time constraints: RT = {r(v j ): the release time of v j is its edge-consistent release time r(v j )}.
• Deadline constraints: D = {d j : if v j is a successor of v i or the edge-consistent release time of v j is greater than that of v i , then the deadline
• Resource constraints: the same ILP processor as in P .
To compute the l max (v i )-successor-tree-consistent deadline for a non-sink instruction v i , our algorithm first computes its the successor-tree-consistent deadline. The successor-tree-consistent deadline of v i is the upper bound on its latest completion time in any feasible schedule for the relaxed problem instance P (v i ). The only difference between P (v i ) and P (v i ) is that there is no latency constraint in P (v i ).
Forward Scheduling and Backward Scheduling
In our algorithm, both forward scheduling and backward scheduling are used to compute the l max (v i )-successortree-consistent deadline of each non-sink instruction v i . Forward scheduling solves the following special instruction scheduling problem: Given a set A of n independent UET instructions with integer release times and deadlines, find a feasible schedule σ f on the ILP processor M such that the maximum completion time of all instructions is minimised. Forward scheduling is a greedy scheduling technique where each instruction is scheduled as early as possible. In forward scheduling, an instruction is ready at time t if t is not less than its release time. Forward scheduling works as follows. For each time point 0, 1, . . . , choose a ready instruction v k with the smallest deadline to run on an idle functional unit of type R(v k ). Ties are broken arbitrarily. A schedule generated by forward scheduling is called forward schedule. A forward schedule can be constructed in O(n) time by using Frederickson's linear time algorithm [9, 15] for scheduling a set of UET tasks with individual integer release times and deadlines on multiple identical processors as follows. 
Scheduling Algorithm
In this section, we describe a fast algorithm for scheduling instructions with precedence-latency constraints, individual release times and deadlines on the ILP processor M .
Our algorithm consists of three main steps. The first step is preprocessing. The preprocessing includes computing edge-consistent release times and deadlines for all instructions and sorting 4 arrays which will be used in forward scheduling, backward scheduling and computing the l max (v i )-successor-tree-consistent deadline for each non-sink instruction v i . The second step is computing the l max (v i )-successor-tree consistent deadline d i for each non-sink instruction v i . The last step is constructing a schedule for P by using list scheduling.
Note that by the definition of the l max (v j )-successortree-consistent deadline, if an instruction v j is a successor of v i or the edge-consistent release time of v j is greater than that of v i , then the l max (v j )-successor-tree-consistent deadline of v j must be computed before that of v i . To satisfy this requirement, our algorithm uses an array L of all non-sink instructions which is sorted in non-ascending order of release times. The framework of our algorithm is shown in pseudo code as follows.
array of all instructions in P ; var L: array of all non-sink instructions in P ; begin /****** Preprocessing ******/ compute the edge-consistent release times and deadlines for all instructions; for each instruction vi do begin set its release time to its edge-consistent release time. set its deadline to its edge-consistent deadline. 
in non-decreasing order of deadlines; end end /****** Compute a feasible schedule ******/ compute a schedule σ for P by using list scheduling; end In list scheduling, the priority of each instruction v i is its l max (v i )-successor-tree-consistent deadline and a smaller number implies a higher priority. List scheduling works as follows. At any time, among all ready instructions, an instruction with the highest priority is chosen and scheduled as early as possible on an idle functional unit of same type as the instruction. Ties are broken arbitrarily. An instruction v i is ready at time t if 1) for each immediate predecessor v j of v i v j has finished before t − l ji , and 2) t is not less than its release time.
Our algorithm computes the l max (v i )-successor-treeconsistent deadline of each non-sink instruction v i in two steps. In the first step, our algorithms computes the successor-tree-consistent deadline of v i .
In the second step, our algorithms uses binary search and the successor-tree-consistent deadline of v i to compute its l max (v i )-successor-tree-consistent deadline.
Next we describe these two steps in details. • 
where σ bj is a backward schedule for A(R sj , t max ).
By the properties of forward scheduling and backward scheduling, the successor-tree-consistent deadline of v i is min{d i , t max }.
Our algorithm for computing the successor-treeconsistent deadline of v i is shown as follows: It is not difficult to show that the successor-tree-consistent deadline of v i is min{d i , t max [c]}. The maximum time point t max [j](j = 1, 2, ·, c) can be computed by using disjoint set union-find algorithm as follows. As a result, no feasible schedule exists for P (v i ). If such a t max exists, it is the l max (v i )-successor-treeconsistent deadline of v i . Otherwise, no feasible schedule for the relaxed problem instance P (v i ). As a result, no v [3, 14] v [3, 12] v [3, 15] v [5, 15] v [5, 15] v [5, 12] v [6, 15] v [6, 15] v [5, 14] v [8, 15] v [3, 10] 6 v [6, 15] v [8, 10] v [8, 10] v [4, 9] v [2, 6] Figure 2. The edge-consistent release times and deadlines in P Example 1 Consider a problem instance P with 14 instructions and an ILP processor with two heterogeneous functional units F 1 and F 2 . The precedence-latency constraints, release times and deadlines are shown in Figure 1 where a filled node denotes an instruction which must be executed on F 1 and a non-filled node represents an instruction which must be executed on F 2 . x and y in [x, y] are the pre-assigned release time and deadline of the corresponding instruction, respectively. First, our algorithm computes the edge-consistent release times and deadlines for all instructions. In Figure 2 , x and y in [x, y] are the edge-consistent release time and edge-consistent deadline of the corresponding instruction, respectively.
Let S(r
i , R sj ) = {v k : v k ∈ Succ(v i ) and R(v k ) = R sj } ∪ {v k : v k ∈ V − {v i } − Succ(v i ) and σ f1 i (v k ) ≥ r i , and R(v k ) = R sj }, d[j, 0] = r i and d[j, 1], d[j, 2], · · · , d[j, c j ] be c j different dead- lines of all instructions in S(r i , R sj ) with d[j, 1] < d[j, 2] < · · · < d[j, c j ], where r i is the release time of v i . Partition the time interval [d[j, 0], d[j, c j ]) into c j smaller disjoint intervals π 1 = [d[j, 0], d[j, 1]), π 2 = [d[j, 1], d[j, 2]), · · · , π cj = [d[j, c j − 1], d[j, c j ]). An instruction v k in S(r i , R sj ) belongs to an interval [x, y) if its deadline d k satisfies x < d k ≤ y. Each instruction v k ∈ S(r i , R sj ) is
Let v w1 , v w2 , · · · , v wp be all instructions satisfying the following constraints: 1) For each v w
Next, our algorithm computes the l max (v i )-successortree-consistent deadline for each non-sink instruction v i in non-increasing order of their edge-consistent release times. Suppose that our algorithm has computed the 0-successortree consistent deadline of v 6 which is 7, we show how our algorithm computes the 4-successor-tree-consistent dead- [5, 13] v [8, 10] v [8, 10] v [4, 8] v [5, 12] v [2, 6] v [1, 4] The l max (v i )-successor-tree-consistent deadline of each instruction v i is shown in Figure 6 where y in [x, y] beside each instruction v i is the l max (v i )-successor-tree-consistent deadline of v i . Lastly, our algorithm uses list scheduling to compute a schedule for the original problem instance P . The schedule which is feasible is shown in Figure 7 .
By using induction and the properties of forward scheduling and backward scheduling, we can prove the fol- v [5, 12] v [6, 15] v [6, 15] v [8, 15] 6 v [6, 15] v [8, 10] v [8, 10] v [4, 8] v [5, 13] v [2, 6] v [3, 7] Proof Suppose that there exists a feasible schedule σ , but a schedule σ computed by our algorithm is not feasible. Let v k be the first late instruction and t the earliest integer time point satisfying 1) there are 
, then by pigeon hole principle, there must be a late instruction in any schedule for P , which contradicts the assumption. Otherwise, consider the following special cases.
1. Arbitrary DAG, latencies in {0, 1}, individual integer release times and deadlines and one functional unit. Let v i be the instruction scheduled in time interval [t − 2, t − 1). Consider all possible cases.
(a) No instruction is scheduled in time interval
In this case, by the greediness of list scheduling, the release times of all instructions in S must be greater than or equal to t. Therefore, by pigeonhole principle, at least one instruction must be late in any feasible schedule for P , which contradicts the assumption.
Consider the two possible cases.
i. The release times of all instructions in S are greater than or equal to t. By pigeonhole principle, there must be at least one late instruction in any feasible schedule for P , which contradicts the assumption. ii. There is at least one instruction whose release time is less than or equal to t − 1.
In this case, all instructions whose release times are less than or equal to t − 1 must be the successors of v i . By our algorithm for computing the l max (v i )-successor-treeconsistent deadline, v i must be also late with respect to its l max (v i )-successor-treeconsistent deadline in the schedule σ, which contradicts the assumption that v k is the first late instruction. Since each instruction in S 1 − S 2 must be a successor of some instruction in S 1 , for each instruction v j ∈ S 1 , l + sj ≥ l sr also holds. By our algorithm for computing successor-tree-consistent deadlines, v s must be also late with respect to its l max (v s )-successor-treeconsistent deadline in the schedule σ, which contradicts the assumption that v k is the first late instruction.
4. In-forest, equal latencies, individual integer release times and deadlines, and multiple identical functional units. The proof for this special case is essentially the same as in [11] .
Our algorithm for computing the successor-tree-consistent deadlines uses disjoint set union-find algorithm. Since the union tree in this case is a chain, we can use Gabow's linear time union-find algorithm [15] . Therefore, for each non-sink instruction v i , it takes O(n) time to compute the successor-tree-consistent deadline for v i , where n is the number of instructions. After the successor-tree-consistent deadline of each non-sink instruction v i has been computed, our algorithm uses binary search to compute the 
Conclusion
We proposed a fast algorithm for scheduling instructions in a basic block with precedence-latency constraints, timing constraints in the form of individual integer release times and deadlines on an ILP processor. The key idea of our scheduling algorithm is computing the l max (v i )-successortree-consistent deadline for each instruction. To make it faster to compute the l max (v i )-successor-tree-consistent deadline for each non-sink instruction v i , we use a number of techniques, namely, forward scheduling, backward scheduling, disjoint set union-find and binary search. Our algorithm is guaranteed to find a feasible schedule whenever one exists in a number of special cases. In the first special case where the processor has only one functional unit and the maximum latency is 1, our algorithm improves the existing fastest algorithm [3] from O(n 2 log n)+ min{O(ne), O(n 2.376 )} to min{O(ne), O(n 2.376 )}. In the second special case where the ILP processor has only two identical functional units, our algorithm improves the existing fastest algorithm [1] from O(ne + n 2 log n) to min{O(ne), O(n 2.376 )}. The first polynomial time algorithm for this special case proposed by Garey and Johnson [8] runs in O(n 3 ) time. In the third special case where the precedence-latency constraints can be represented as a monotone interval-ordered graph and the ILP processor has multiple functional units of different types, our algorithm is the first polynomial time algorithm.
Further research on instruction scheduling with timing constraints is expected. One open problem is loop scheduling with individual release times and deadlines on an ILP processor. In non-real-time computing, software pipelining is an efficient approach to employ ILP. In real-time embedded systems, timing satisfaction is the primary consideration. It is interesting to see how release times and deadlines are handled in software pipelining. Another open problem is scheduling instructions with timing constraints on clustered ILP processors. On a clustered ILP processor such as Lx, communication constraints exist. If two instructions with data dependency are assigned to different clusters, communication delay between these two instructions must be respected in any valid schedule. However, if these two instructions are assigned to the same cluster, there is not communication delay. It is not known if there is any consistency technique for handling communication constraints efficiently.
