State diagram based approach has been proposed as an e ective way to model resource constraints in traditional instruction scheduling and software pipelining methods. However, the constructed state diagram for software pipelining method (i) is very large and (ii) contains signi cant amount of replicated, and hence redundant, information on legal latency sequences. As a result, the construction of state diagrams can take very large computation time. For example, in modeling the resource constraints of the DEC Alpha 21064 processor, it took more than 24 hours on a 250MHz high-performance workstation to construct, say, 100,000 distinct latency sequences. In another experiment, out of the 224,400 latency sequences generated, only 30 were distinct. These make state diagram based approach impractical in real compiler implementation.
Contents
Recent studies on modulo scheduling, an instruction scheduling method for loops 9, 11, 16, 17] , in a production compiler has reported signi cant improvement (upto 35%) in the overall runtime for a suite of SPEC oating point benchmark programs 18]. On the other hand, rapid advances in VLSI technology and computer architecture present an important challenge for compiler designers: a modulo scheduler must be able to handle machine resource constraints much more complex than before and may need to search for an increasingly larger number of schedules before a desirable one is chosen. Therefore, in exploiting the advantage of modulo scheduling in a production compiler, it is important to be able to handle complex resource constraints e ciently and nd a good modulo schedule very quickly. Proebsting and Fraser proposed an interesting approach that uses nite state automata for modeling complex resource constraints in instruction scheduling 15]. Their approach was based on 13] and subsequently improved and applied to production compilers in 2]. This approach was extended to software pipelining methods in 6, 7] . This paper focuses on the e cient construction of nite state automata for software pipelining methods. In the above methods 2, 6, 7, 13, 15] , processor resources are modeled using a nite state automata (or state diagram) which is constructed from the resource usage table for each instruction (class). Each path in the state diagram represents a legal latency sequence which could be directly used by the scheduler for modeling resource contention. The construction of the state diagram is typically done o -line and stored in some form so that the instruction scheduler, at the schedule time, only need to read this information. This has e ectively reduced the problem of checking structural hazards in the scheduling method to a fast table lookup, resulting in several fold speedup 2, 7, 13] . In particular, the enhanced Co-Scheduling method, a state diagram based modulo scheduling method, reports a 2-fold speedup in the scheduling time (time to construct the software pipelined schedule) 7].
A major challenge facing this Co-scheduling method is the huge size of the state diagram. The size of the state diagram increases exponentially with (i) larger values of initiation interval (of the software-pipelined schedule) and (ii) pipeline function units which share resources (e.g., function units sharing decode stages, and register read-write ports). Further, in the constructed state diagram a signi cant number of paths are found to contain replicated information regarding legal latency sequences. (Henceforth, we refer to these paths as redundant paths.) The previous work on reduced state diagram 7] has met with only limited success in eliminating this redundancy.
To illustrate the above problem, we present several experimental evidences. In one experiment conducted by us, the state diagram for the DEC Alpha 21064 processor with an initiation interval (II) equal to 3, contained 224,400 latency sequences, out of which only 30 were distinct. In another experiment, the state diagram for an II = 16 contained more than 74 Million (74,183,493) distinct paths 1 . Lastly, to generate 100,000 distinct latency sequences in the above state diagram, it took more than 24 hours on an UltraSparc machine.
Though, it is theoretically interesting to construct the complete state diagram involving all non-redundant paths, it becomes ine cient for the instruction scheduler/software pipelining method to deal with all these latency sequences. In order to make the state diagram based approach attractive and practical for implementation in real compilers, it is important to provide the scheduler with a large subset of distinct latency sequences/paths in a short computation time.
In this paper, we propose two methods to drastically reduce the construction time of state diagrams. The rst of these methods relates the construction to a well-known problem in graph theory, namely the enumeration of maximal independent sets of an interference graph. This facilitates the use of an existing enumeration algorithm as a direct fast method for constructing all latency sequences. We refer to this method as the Enumeration of Maximal Independent Set (E-MIS) method. Two major advantages of this method are that it is a direct method and it generates only distinct (non-redundant) latency sequences.
The second method uses a heuristic approach to eliminate redundant paths by exploiting the structure of state diagram construction and by employing an aggressive redundance removal approach. This is accomplished by enforcing a surprisingly simple redundance constraint which eliminates completely all redundant paths. We refer to this method as the Redundancy Prevention (RP) method, as it identi es, at the earliest in the construction process, states which could cause redundancy and prunes them aggressively. We formally establish that the proposed heuristic results in redundance-free state diagram. However, the redundance constraint used by the heuristic is only a necessary (but not su cient) condition. As a consequence, the aggressive pruning may eliminate some non-redundant paths as well. However, we nd that the RP method to work well in practice.
We compare the e ciency of the proposed methods with that of the reduced state-diagram construction (RD) method in modeling two real processors, namely the DEC Alpha 21064 processor and the Cydra VLIW processor 3]. Our experimental results reveal that the proposed methods result in signi cant reduction in the construction time by about 3 to 4 orders of magnitude. For example, the RP and E-MIS method took only 1.2 and 1.5 seconds respectively, to construct the rst 100,000 distinct latency sequences for the DEC Alpha processor for a particular II, while the RD method took more than 24 hours. Another interesting observation made from our experiments is that the RP method, though a heuristic approach which can possibly eliminate non-redundant paths, does reasonably well in enumerating all non-redundant paths. In fact, for the two processors modeled and for small values of II less than 16, it did not miss a single path. Lastly, we use the latency sequences constructed by RP and E-MIS 1 The number of actual paths (including the redundant ones) is far too many to count and generate! methods in the Co-Scheduling framework to construct software pipelined schedules. Initial experiments reveal that the proposed methods do seem to perform competitively, and in a reasonable computation time. This provides an empirical evidence for the use of state diagram approach as a practical and e cient method in software pipelining methods.
In Section 2, we present a brief review of the state diagram based software pipelining method. The problem formulation and the proposed approaches are informally discussed in Section 3. Section 4 and 5 respectively present the details of the two proposed methods. In Section 6 we compare the performance of E-MIS and RP methods with the reduced state diagram construction. Section 7 discusses related work and concluding remarks are provided in Section 8.
Background and Motivation
Software pipelining has been found to be an e cient compilation technique that results in significant performance improvement at runtime 9, 11, 16, 17]. Modeling of resource constraints (or structural hazards) in software pipelining is becoming increasingly complex in modern processor architectures. However, an e cient resource model is crucial for the success of the scheduling method 2, 4, 6, 15] . In this section, rst, we review the state diagram based resource model used in software pipelining 6]. Subsequently we motivate our approaches to the e cient construction of state diagram.
Background
Conventional approaches to model resource constraints uses a simple reservation table (see Figure 1(a) ?! S 4 corresponds to initiations at time steps (0; 2; 4; 6). These o set values for the path are collectively referred to as o set set, or in short form OffSet. Once the state diagram is constructed, then the OffSets corresponding to various paths can be used to guide the enhanced Co-Scheduling method to make quick and \good" decision about when (at what o set value) to schedule an instruction in the pipeline 7]. S11(4)** S16(5)** S19(6)** S12(6)** S14(10)* S6(6)** S5 (5) S9 ( 
State Diagram Explosion and the Redundancy Problem
The major challenge in the construction of a state diagram is that it can become extremely huge, consisting several million states, even for moderate values of II. For example, for the DEC Alpha architecture, the state diagram consists of 10, 648, and 224,400 paths when II is varied from 1 to 3! Fortunately, several paths in the state diagram are shown to be redundant 7, 8] in the sense that the corresponding OffSets are equal. In the above 3 state diagrams for DEC Alpha, only 3, 9, and 30 paths are distinct.
For the state diagram shown in Figure 2 to an initiation at time 10 (shown in brackets adjacent to the state number in Figure 2 ). In Figure 2 , these states (e.g., S 7 ; S 13 , and S 14 ) are marked with a` ' and the arcs leading to these states are shown as dotted lines.
The following section states this problem of state explosion and informally presents the proposed approaches.
Problem Formulation
The RD construction method 7] has the limitation that it is somewhat ine cient in removing the redundancy and also cannot completely eliminate all redundant paths. This is especially so for architectures involving multiple instruction types where di erent function units may share some of the resources, a common feature in real architectures.
In order for state diagram based approach to be useful in real compilers, it is important to deal with the redundancy-elimination in an e cient way. Further, even after eliminating all redundant paths, the state diagram may still consist of a large number of distinct paths. Hence, for practical reasons, it is important to construct a large subset of distinct OffSets in a short computation time so that the scheduler can use this information e ectively.
In this paper we propose two e cient solutions for the above problem. In the rst solution, the E-MIS method, we relate the generation of OffSets to a well-known graph theory problem, viz., enumerating the set of maximal independent sets. This facilitates the use of an existing e cient algorithm for the enumeration to be used as a direct method for generating the OffSets without incurring any redundancy. We motivate the second approach using the state diagram shown in Figure 2 . Consider the states S 6 and S 11 in Figure 2 which lead to only redundant paths. Hence these states should be eliminated at the generation time itself to prevent redundancy. These states ( S 6 , S 11 , S 16 and S 19 ) marked with a double star` ' in Figure 2 . The RP method identi es these states using a necessary (but not su cient condition) and prunes them aggressively to prevent redundancy.
Throughout our discussion in this and in the following two sections, for reasons of simplicity, we have considered only state diagrams for function units that do not share resources, However, it is straightforward to extend the ideas of the original RD method, the proposed RP and E-MIS methods for pipelines that share resource. Using such an extension we model the DEC Alpha 21064 processor and the Cydra VLIW processor where resources are shared among function units.
A Graph Theoretic Approach for OffSets Generation
In this section we relate the generation of OffSets to a well-know graph theory problem which facilitates the use of an existing algorithm as a direct and e cient method. This is based on a correspondence between the set of OffSets and the maximal compatibility classes which was established in 7]. De nition 4.3 A subset S of vertices is an independent set, if there exists no edge between any two vertices in S. An independent set is said to be maximal, if it is not contained in any other independent set.
In our example graph, fv 0 ; v 1 ; v 3 ; v 5 g and fv 0 ; v 2 ; v 4 g are maximal independent sets. From the above de nitions and the description of the interference graph, it clearly follows that each maximal independent set in the graph corresponds to a maximal compatibility class, and hence an OffSet in the reduced state diagram. Thus to generate the set of all OffSets of the state diagram, one needs to enumerate all maximal independent sets in the interference graph.
An e cient algorithm for this is reported in 19]. This enumeration of MIS is a direct method for the generation of all OffSets. An attractive features of this approach is that it 
Redundancy Prevention Method for State Diagram Construction
In this section, we present the details of the RP method which exploits the the structure of the state diagram construction to identify and eliminate redundancy at the earliest opportunity.
Further an attractive feature of the RP method is that it follows an aggressive approach for redundance elimination, even though this may mean missing a few OffSets. However our experimental results show that this aggressiveness pays well in state diagrams for real architecture without much loss of information.
The Redundancy Prevention Algorithm
In this discussion we will assume that the construction of the state diagram proceeds top down, in a depth-rst fashion. In our example state diagram (Figure 2) The basic steps in the RP method are:
(1) Follow the depth-rst construction rule to construct the state diagram (as in Procedure A.1 in Appendix A).
(2) If a given state S has already created k child states S 1 ; S 2 ; ; S k and is going to create a new child state S k+1 (it should be noted here that, at this time, all states in the subtrees rooted at S 1 ; S 2 ; , S k have been constructed), the redundancy constraint checks whether the issue time (o set value) of S k+1 has occurred in subtrees rooted at S 1 ; S 2 ; ; S k . If so, the state S k+1 will not be created; else it is added to the state diagram.
Step 2 sounds time-consuming because for each new child state to be created we need to check all states in the left sibling subtrees. It should be noticed here that the range of o set values is xed, i.e., from 0 to II?1. Therefore, it is not di cult to construct an implementation for the RP method the RC check could be done as a simple table lookup. The table, called \left sibling states" records the issue time of all states in the left sibling subtrees. The size of this table depends on the model, but is very small. Further, for computational e ciency, the RP method constructs the state diagram as a tree rather than as a directed acyclic graph 7]. This means that certain states, e.g., the nal state of the state diagram which contains an empty permissible latency set may be repeated several times (states S 5 , S 9 , and S 10 ) as shown in Figure 2 .
The attractive features of this approach are that it exploits the structure of state diagram construction (top-down and left-to-right) to eliminate redundancy at the earliest opportunity. Second, the RP method is aggressive in redundance elimination. Lastly, as a comparison, the E-MIS method is an exact method, and hence has the overhead of having to construct the complete interference graph, before the generation of OffSets.
Properties of RP Method
Note that the RP method uses the redundancy constraint to eliminate states that could possibly lead to redundant paths. This raises two questions: (i) Does the RP method eliminate all redundant paths? (ii) Will the RP method ever eliminate a path that is non-redundant? We answer these two questions in this subsection. The proof of this theorem proceeds by showing that, at any point in time, the partial state diagram tree constructed so far does not have any redundancy. The details are given in Appendix B.
We remark that the RP method uses the redundancy constraint which is only a necessary, but not su cient condition. As a result it may miss some non-redundant paths due to the aggressive pruning employed by this method. However, we argue that there is still a good chance that of the \missed" OffSets will reappear in some other paths of the state diagram. In Section 6 we study empirically how many non-redundant paths does RP miss in the state diagram and whether they have any in uence on the constructed software pipelined schedule.
Experimental Results
In this section we report our experience in generating the OffSets using the proposed methods 
Construction Time Comparisons
In order to compare the construction speed of RP, E-MIS and RD methods in a fair manner, all the three methods were run to generate a large subset of OffSets, consisting of the rst, say, 100,000 distinct OffSets. Tables 4 (in Appendix C) compares the construction time for OffSets generation for the three methods on an Ultra-450 Workstation (with a 250 MHz clock). We observe that the RD method is much slower than RP and E-MIS. For example, the RD method failed to generate 100,000 distinct OffSets even after running for 24 hours for the Alpha architecture, for an II = 8. Hence for larger values of II, we only compared the RP and the E-MIS methods. Figure 4 shows the normalized (w.r.t. the RP method) construction time taken by the E-MIS methods for various values of II for the two architectures. We observe that the RP and the E-MIS methods are competitive: E-MIS performs better for small values of II while RP performs better for moderate to large IIs. This is not surprising as RP is a very e cient heuristics that we employ to get the OffSets quickly while the E-MIS method, being inherently an exact method, is slow, especially for large values II. This is due to the fact that the E-MIS method needs to construct the entire (large) interference graph before the generation of OffSets.
How Many Paths Does RP Miss?
As mentioned earlier the RP method can miss some non-redundant paths during the construction. It is important to verify that RP will not miss too much useful information. On the other hand, since E-MIS is capable of generating all distinct OffSets, we can compare the number of OffSets generated by these two methods if these methods were allowed to complete their execution without any restriction on the maximum number of OffSets generated. 
Application to Co-Scheduling
Lastly we attempt to answer the question how critical are the missed OffSets, if any, in terms of the quality of the constructed software pipelined schedule, when the the enhanced CoScheduling method uses the OffSets generated by the RP method. Since the RP method did not miss any path for the DEC Alpha and Cydra processor (for the values of II that we could experiment), its application in the enhanced Co-Scheduling should perform at least as good as any other OffSet generation method, e.g., E-MIS or RD methods, applied to Co-Scheduling. What happens if the RP method does miss some OffSets? To answer to this question, we considered the reservation tables used in 7] which model function units with complex resource usage, but without any sharing of resources. In these reservation tables the RP method misses a few paths even for small values of II. For example, for one of the reservation tables, the RP method generated only 22 out of 26 distinct OffSets. Also, to reduce the complexity of the enhanced Co-Scheduling method, the scheduling method used all distinct OffSets, up to a maximum of 1000, generated either by RP or E-MIS method. The enhanced Co-Scheduling was applied to a set of 737 loops consisting of single basic block extracted from scienti c benchmark programs. We compare the constructed software pipelined schedules, in terms of the initiation interval and the construction time for schedule in 
Summary of Results
To summarize our experimental results:
The RD method is too slow in the construction of state diagram for architectures in which function units share resources. Hence it is impractical to apply this for modeling real processors.
Both the RP and E-MIS methods are much faster than RD, by 3 to 4 orders of magnitude, and their construction time is acceptable for implementation in real compilers.
In terms of construction speed, the E-MIS method performs better for small values of II while RP method performs better for large II values.
Though the RP method can not guarantee to always generate all distinct OffSets, we found that it is capable of generating all non-redundant OffSets in the experiments that we conducted. Further, we found that the generated OffSets when applied to the Co-Scheduling method can get equally good performance, in terms of the constructed schedule.
Related Work
Finite state automaton for modeling resource usage was proposed in 2, 13, 15] for instruction scheduling methods. These methods use ideas from the classical pipeline theory, especially the notion of forbidden and permissible latency sequences. The size of the constructed automaton was found to be large in 13] which was subsequently improved by Proebsting 15] and Bala 2] . The Co-Scheduling framework proposed in 6] and subsequently extended in 7] is a state diagram-based software pipelining method. The size of the constructed state diagram is an even more serious problem in software pipelining than in instruction scheduling. In this paper, we have proposed two methods to e ciently generate the distinct OffSets of a state diagram. Both methods avoid redundancy in the construction and hence are found to be very e cient. Compared to the reduced state diagram construction method, the proposed RP and E-MIS methods result in a signi cant reduction in the construction time, by 3 to 4 orders of magnitude. Lastly, an alternative method to model resource usage was proposed by Eichenberger and Davidson 4] . This method relies on the use of global resource table, but reduces the cost of structural-hazard checking by reducing the machine description. Their approach uses forbidden latency information to obtain a minimal representation for individual reservation tables of di erent function units.
Conclusions
In this paper we have proposed two e cient methods to construct distinct paths in the state diagram used for modeling complex pipeline resource usage in software pipelining. The methods proposed in this paper, namely the E-MIS method and the RP method completely eliminate the generation of redundant paths. The rst of these methods, the E-MIS method is obtained by relating the OffSets generation to a well-known graph theoretic problem, viz., the enumeration of maximal independent sets of an interference graph. The second method, the RP method, uses a simple but very e ective heuristic to prevent the construction of states that may cause redundant paths. We formally establish that this heuristic results in redundancefree state diagram. We compare the performance of RP and E-MIS methods in modeling two real processors, namely the DEC Alpha processor and the Cydra VLIW processor. The RP and E-MIS methods were found to be much superior than the RD method, by 3 to 4 orders of magnitude. When compared between themselves, the RP and the MIS methods perform competitively, the RP performing better for larger values of II while the E-MIS performing better for small II. Lastly, we have applied the OffSets generated by these methods in the enhanced Co-Scheduling method and reported their performance. The time elapsed between two initiations in a pipeline is termed latency. A latency is said to cause a collision if the two instructions require the same stage of the pipeline at the same time. Multiple operations can simultaneously be processed in the pipeline as long as there is no collision. A latency that results in a collision is called forbidden latency. The distance between pairs of X marks in the rows of the CRT determine the forbidden latencies in an MS-pipeline.
If there exists a row s in the CRT such that both (s; t) and (s; (t + f) mod II) contain an X mark, then f is a forbidden latency. In an MS-pipeline, if f is forbidden, then II ? f is also forbidden. The latencies that are not forbidden are termed as permissible. The forbidden and permissible latency sets for the reservation table shown in Figure 1( Step 1 Start with the initial state having the initial permissible latency set S 0 .
Step 2 ?! S n be the path with the corresponding OffSet fo 0 ; o 1 ; ; o n g. Now we will show that P is a non-redundant path. The proof is by contradiction. Suppose P is redundant. In the path P, let S k 1 be a state such that all states S 1 ; S 2 ; ; S k 1 ?1 are left-most children of their parents (respectively, S 0 ; S 1 ; ; S k 1 ?2 ) and S k 1 is not a left-most child. Note that k 1 can be such that 1 k 1 n (refer to Figure 5 ). Since P is redundant, then either (A) the o set value o k 1 corresponding to the state S k 1 must appear in the left-sibling tree(s) of S k 1 or (B) the o set values o k 1 +1 ; ; o n must appear in the subtree rooted at S k 1 . Clearly (A) cannot be true since if o k 1 appears in the left-sibling subtree, then it implies that S k satis es RC, and hence would have been eliminated by the RP method. Suppose (B) is true, i.e., o k 1 +1 ; ; o n appears in the subtree rooted at S k 1 . In the path 
C Details of Experimental Results
In this section we report the detailed comparison of the execution time of RP and E-MIS methods for constructing the state diagram, up to a maximum of 100,000 distinct OffSets. 
II

