Code generation methods for DSP applications are hampered by the combination of tight timing constraints imposed by the pelformunce requirements of DSP algorithms, and resource constraints imposed by a hardware architecture. In this paper; we present a method to analyze resource-and tbning constraints in a single model. The analysis identijies sequencing constraints between operations additional to the precedence constraints. Without the explicit modeling of these sequencing constraints, a scheduler is often not capable of jinding a solution that satisjies the timing and resource constraints. The presented approach results in an ejicient method of obtaining high quality instruction schedules,
Introduction
Digital Signal Processing (DSP) algorithms have the following characteristics: they allow parallel execution of a number of operations, periodic executions, and tight timing requirements. These algorithms can be represented using a signal flow graph (SFG), consisting of operations, data flow edges and sequence edges. The precedence constraints expressed in such a SFG combined with the tight timing requirements, already pose a problem for most scheduling methods targeting for efficient hardware. However, mapping the DSP algorithm on a hardware platform also introduces hardware constraints, which make the schedule problem even more difficult to handle. The hardware constraints from the mapping depend on the choice of hardware platform. We consider two types of hardware on which to map a DSP algorithm: ASPS ( 193, appIication domain specific hardware), and digital signal processors. ASIPs are tuned towards specific application domains and have become popular due to their advantageous trade-off between flexibility and cost. The fixed architecture of an ASIP introduces resource constraints for the instruction scheduler in the code generation process.
General purpose digital signal processors offer more flexibility than ASIPs.They also introduce resource constraints as a result of a fixed architecture. For reasons of hardware efficiency and code density, the instruction set of both an ASIP and a general purpose DSP is often irregular, excluding certain types of parallel execution. This introduces another kind of hardware constraint a scheduler has to deal with.
In order to obtain efficient code, it is necessary for a scheduler to take into account resource constraints, instruction conflicts, and timing constraints originating from the application. The combination of these types of constraints poses a problem for traditional scheduling techniques. Many scheduling heuristics have been proposed, each with an optimization criterion. They are however unable to cope with the interactions of different types of constraints. In this paper we therefore address the schedule problem as a feasibility problem and propose a solution strategy based on the analysis of all the constraints. Section 2 discusses some related work, illustrating the difficulty of scheduling with tight constraints, and proposes a global solution strategy. In Section 3 we show how all the different constraints can be combined in the model of a precedence graph. In Section 4 we describe the method of analysis and the interaction with a simple off-the-shelf scheduler to obtain a schedule, and in Section 5 some results will be presented.
lOSO-1820/97 $10.00 0 1997 IEEE
Related work and contributions
As a result of the difficulty in solving the DSP scheduling problem, a range of heuristics, each with its own optimization criterion, have been proposed. Probably the most widely used scheduling heuristic in high-level synthesis [7) is the list scheduler {I]. This heuristic takes a fixed set of resources, and optimizes latency (completion time). A well-known heuristic that optimizes resources given a constraint on the latency and throughput is force-directed scheduling [2] . Furthermore, in order to obtain higher utilization rates for the resources, it is useful to let executions of different loop iterations overlap. When using a counter-based controller, like in the Phideo compiler [4] , this can be done very efficiently. In a micro-coded controller, overlapping executions are realized by software pipelining [3] , also called loop pipelining or loop folding. Such methods assume that the scheduler is able to handle resource conflicts between operations belonging to different loop iterations. The difficulty in handIing these inter-iteration conflicts is illustrated with a small example in Figure 1 . In this figure, a precedence graph of 5 operations is given (the arrows indicate a precedence relation). In order to meet the constraint of 3 clock cycles on the initiation interval (II), loop folding has to be applied (indicated by the arrow in Figure 1 b and c). Because folding introduces extra code, we do not want to fold more than once, which constrains the latency to 6 clock cycles. In Figure lb the result of a list scheduler is shown. The left column contains the time porentiul (schedule time modulo II). The list scheduler greedily schedules A, B, and C as soon as possible (ASAP), and concludes that D cannot be scheduled. In Figure lc a valid schedule is given. The key to obtaining this schedule is to postpone B one clock cycle relative to its ASAP value. However, most schedule heuristics lack the scope to postpone operations. The scope of a scheduler could be expressed as the schedule freedom (in clock cycles per operation) available to it. The scope of most scheduling heuristics is limited to the apparent schedule freedom (also called mobility), calculated as the difference between the ALAP (as late as possible) value and the ASAP (as soon as possible) value, based on the precedences between operations, but without taking the resource conflicts into account. In Figure 1 , ihe apparent freedom is 1 clock cycle per operation. The reader can verify that the combination of precedence, resource, latency, and throughput constraints leaves no scheduIe freedom at all: the schedule in Figure lc The general problem statement for finding a feasible schedule, as depicted above, is as follows.
Problem 1: Given a (cyclic) signal flow graph (SFG), n set of resource conflicts, a latency, and an initiation interval (II), find a schedule.
In this context, resource conflicts include conflicts on execution units, bus access, memory access, turd instruction conflicts. Note that two tasks have been ignored in the problem statement: finding the optimal initiation interval, and determining a resource-binding for the operations. We assume that instruction sclcction has been done (for example using method {12]) prior to the scheduling phase, thus providing a resource binding. The optimal II is found as follows. We start with a lower bound based on loop-carried dependencies [l l] and available resources. When this II is not feasible, it is incremented by one clock cycle. Profiling suggests thnt the optimal II is usually only one or two clock cycles away from the lower bound.
A special case of problem 1 has been proven NPcomplete in [8] . Because of this inherent difficulty and the limited success of scheduling heuristics on heavily constrahred applications, we propose a SOhIliOn strategy based on the interaction of anaIyzing the constraints, and making scheduling decisions. The problem statement for the analyzer is then as follows.
Problem 2:
Given a SFG, a set of schedule decisions, a set of resource conflicts, a latency, and an iniijation interval, reduce the schedule freedom as much as possible without excluding any feasible solutions.
In Section 4 we will use analyses to tackle Problem 2, and then show how these analyses can be used to tackle Problem 1. ProbIem 2 is essentially constraint analysis. Not much work has been done in this field. In [5] the combination of resource and timing constraints is analyzed in a general way. Because it is difficult to extend this work to allow loop folding, it is not suitable for our specific schedule problem, Loop folding in the context of resource and timing constraints is considered in [6] . In this work, a bipartite matching formulation is used to analyze the matching of execution intervals of operations to execution intervals of resources. Reductions in the execution intervals are obtained by showing that some matchings can never be part of a complete matching. The bipartite-matching approach is based on the concept of execution interval. The approach taken in this paper is based on sequence relations between operations. In Section 5, we will compare these two approaches, and show that our approach is able to obtain further reductions in the scheduIe freedom after the analysis done in [6] .
last operation may not be executed more than I clock cycles after the start of the first operation.
. Microcoded controller and loop folding. We assume that the architecture contains a microcoded controller. As a consequence, the same code is executed every loop iteration. This implies that a variable is written in the same register each iteration. When loop iterations overlap, we have to ensure that a variable is consumed before it is overwritten by the next production. Since subsequent productions are exactly ZZ (initiation interval) clock cycles apart, a variable cannot be alive longer than ZZ clock cycles. So the operation C that consumes a variable must execute within ZZ clock cycles after the operation P that produces the variable. Just like the latency constraint, a necessary and sufficient transIation to the precedence model is that for each data dependency (P,C) there is an arc {C,P) with w = -II.
Modeling the constraints
The analyzer needs a uniform model of the DSP algorithm, which enables the expression of a11 constraints mentioned in the introduction. For this purpose, the model of a (directed) precedence graph is used [lo] . It consists of a set V of vertices representing operations, and a set A E V x V of arcs, called precedence edges or sequence edges, representing precedence relations between operations. For v E V let s(v) denote the start time of operation v. An arc (vi, vi> with weight w indicates that S(Vj) 2 s(vi) + w . TWO dummy operations are added to the precedence graph: a source and a sink. The source operation is always the first operation to execute, and the sink the last one. We will now show how constraints can be incorporated in this model.
. Data dependencies.
The translation of the data dependencies in this model is straightforward: each dependency translates to an arc with weight equal to the execution delay of the producing operation.
. Latency. A constraint 1 on the latency is translated to an arc (sink, source) with w=-Z, as illustrated in Figure  2 . This is interpreted as s(source) 2 s(sink) -I, which is equivalent to s(sink) 5 s(source) + 1, meaning the tions can be directly translated to a sequence edge.
When an operation v is fixed at a certain clock cycle c, we need two sequence edges as indicated in Figure 4 l Resource conflicts and instruction set conflicts. We use method [6] to model instruction set conflicts as resource conflicts. To incorporate resource conflicts we need an additional formaIism. 
Constraint analysis and schedule iuteraction

Constraint analysis
In this section we elaborate on the solution strategy for problem 1 and 2. First we focus on problem 2: The reduction of the schedule freedom based on the analysis of the constraints. This reduction is made explicit by adding sequencing constraints (edges). Such sequencing constraints are easily handled by all scheduling heuristics, so they effectively reduce the schedule freedom available to the scheduler, The sequence constraints enable the analyzer to observe the different constraints in a uniform model (precedence graph). By incorporating all the constraints in a uniform model we can efficiently explore the interactions between the different constraints. One can think of rules that express the interaction between different types of constraints in terms of a sequence constraint. We show two effective rules that involve the combination of resource conflicts, precedence, and timing constraints. The extension of our approach to incorporate register conflicts is illustrated by a simple rule that sequentializes variable lifetimes.
All the rules used in our approach use the concept of a path between operations. In the following examples a path is indicated using a dashed arc labelled with the length of the pith.
The first rule presented below, affects the timing relation between conflicting operations. It is based on the fact that two conflicting operations cannot be scheduled at the same potential. The time porenria2 associated to a time t is t mod II. So if the distance between these operations would cause them to be scheduled at the same potential, the distance has to be increased by at least one clock cycle. This rule will help us to solve the schedule problem in Figure 1 . From the discussion in Section 2 we concluded that the difficulty of the schedule problem is to put a gap of one clock cycle between A and B, So our goal is to derive that d(A,B) = 2. In Figure 5 this derivntion is given. Figure 5a represents the precedence model of Figure 1 a. In Figure Sa we see a path A->B+C->D of length 3 E 0 mod II from A to D. According to rule 1 we can add a sequence edge A->D of weight 4 bccauso A and D have a resource conflict. This edge is drawn in Figure 5b . Next there is a path D->E->sink->sourcc-~A->B of length 1+1-6+0+-l =-3 clock cycles. Bccnusc of the resource conflict D-B, this length has to be incrcascd by one clock cycle. This gives a sequence edge D->B of weight -2, as given in Figure SC . We conclude by finding a path A->D->B of length 4-2=2 clock cycles. In Figure 5d the associated sequence edge (A,B) of weight 2 is explicitly drawn. The precedence relations now completely fix the schedule.
The second rule we present in this paper is moro complicated, and involves symmetry in the prcccdenco graph. Consider the small piece of precedence graph depicted in Figure 6 . The distance from A to D is two clock cycles. However, B and C have to be ordered because they have a resource conflict, and both possible orderings will result in d(A,D)=3. Rule 1 will not help us here. In DSP-algorithms, this type of symmetry is a very frequent phenomenon and has to be deaIt with in the analyzer. Rule 2 will cope with this issue in the context of loop folding.
Rule 3: Let variable vl, produced by operation pl and consumed by cl, and variable v2, produced by operation pl and consumed by cl, reside in the same register. If d(p,, p2) 20 we can add a sequence edge @t,p2) with weight 0.
Rule 3 is illustrated in Figure 8 . The variables vl and v2 are bound to the same register. This is often a resource conflict: B-C priori known because most general purpose DSPs have special purpose registers and flag registers. As a result, the binding of some variables is fixed when special purpose registers or flags are used. If there is a path of positive length from Pl to P2, then the whole lifetime of variable vl has to precede the lifetime of v2. This is Figure 6 Too much apparent mobility due to symmetry made exphcit by adding a sequence edge from the consumer Cl to the producer P2. In Figure 7 , the symmetry is of a slightly different kind. As can be seen in the ASAP schedule, the only way the minimum distance of 4 clock cycles from A to H can be realized, is to schedule operations B and G at the same potential. Because B and G have a resource conflict, the distance from A to H is not 4, but 5 clock cycles. This also follows from The next rule (rule 3) considers register conflicts. 
4.2
Complexity
In our implementation the longest path between each pair of operations is administrated. So if a new edge is added, the impact on the current longest paths has to be calculated. Therefore the complexity of adding a sequence edge is the dominant factor in run time. This complexity is essentially determined by the number of paths that need to be updated as a result of the new sequence edge. Because we are only interested in the Iongest paths found so far, the number of updates equals V* worst case. In most cases, the addition of a sequence edge will affect a few paths. In cases where a lot of paths need to be updated, the reduction in schedule freedom will also be substantial An upperbound on the number of path updates (as a result of adding a sequence edge) can be derived as follows. A path can have a length between -I and +I, where I is the constraint on the latency. Because a path is updated only if its length is increased (by at least one cIock cycle), the number of times a path can be updated is at most 21. Since the maximum number of paths we keep track of, equals V*, the number of path updates can be at most 21. V' . A single path update takes constant time, so the run-time of the analysis is poIynomially bounded.
Interaction with an instruction scheduler
Rules 1 to 3 enabIe us to tackle Problem 2, mentioned in Section 2. Problem 1 is handled by interacting between the analyzer and a scheduler. For this interaction to work we need to address two issues:
. The analyser can handle schedule decisions. Schedule decisions can be modelIed as indicated in Section 3.
l The scheduler can take the analysis results into account. These results are in terms of sequence edges and thus express precedence. The least one can expect from an instruction schedrder is to take precedences into account.
Furthermore, we should not forget that we are dealing with an m-hard problem. So it is very well possible that the analyzer is unable to prevent a scheduler from making an infeasible decision, thus forcing us to backtrack.
Results
Three experiments are reported in this section. In two cases, our approach (Constrained Analysis and Scheduling, or CAS) is compared to the approach of 161 using a bipartite schedule graph (BSG). The first experiment shows the effect of Ioop foIding. The second experiment illustrates the effect of register binding. The last experiment, an example taken from an actual design, is used to show the rnn time in a practical application. With respect to the resource constraints, only rules 1 and 2 have been used. In the last two experiments, also some rules are used for considering register conflicts. The quahty of the analysis is measured by the reduction in schedule freedom.
The first exampfe is a radix-2 butterfly, shown in Figure 9 . Because the multiplication is a muhicycle operation, it is split in two stages ml and m2, making a total of 10 operations. In Table 1 the result of the analysis is shown, expressed in the average schedule freedom per operation in clock cycles. For the non folded schedule, the reduction in schedule freedom is exactIy the same for BSG and CAS. However, foiding the schedule increases the reduction for the BSG anaIysis only slightly, whereas the CAS analysis shows that only one operation (r0) has any schedule freedom left. For this example there is no gain in accuracy by combining the results of BSG analysis and CAS analysis. The ran times for both BSG and CAS are negligible. The execution intervals for each operation am given in Figure 10 for the folded scheduIe. Zn this figure, the time is represented in clock cycles at the horizontal axis,
The operations are enumerated vertically. The white arca represents the reductions obtained by both BSG and CAS analysis. For example, the execution interval of operation r0, based on ASAP and ALAP is [I, 51. BSG and CAS are able to reduce this interval to [3, 4] . The grey area is the reduction obtained by CAS, that BSG was not able to find.
Notice in this figure how reduction techniques such as CAS and BSG prevent a greedy scheduler from making a wrong schedule decision. A greedy scheduler would schedule operation r0 in clock cycle 1, leaving no room for wl in clock cycle 7 (potential I), which is the only feasible start time for wl. Both CAS and BSG detect that r0 can not be scheduled in clock cycle 1. However, BSG is unable to detect that operation + must precede operation -. A greedy scheduler can easily go wrong by letting operation -precede operation +.
The second experiment concerns an IiR fiIter of 23 operations, including fetching the coefficients and data from memory. The minimum latency is 10 clock cycles, which equals the latency constraint. Table 2 Average freedom for IIR filter
In Table 2 the result of the analysis is shown. BSG anaIysis is capable of deducing some reductions CAS can not find, and vice versa. As can be seen in Table 2 , combining the analyses provides larger reductions than both analyses separately. In the sixth column, the remaining schedule freedom is shown when CAS analyzes the first schedule decision a greedy scheduler could make (operation 23 at clock cycle 0). It is clear from the numbers in the fifth column that substantial reductions can be made not only before scheduling, but also during scheduling. This observation strengthens the idea of an interaction between the analyzer and the scheduler. The Iast column shows the reduction in schedule freedom when rides are appIied in CAS that consider register conflicts. These register conflicts are the result of a register binding. Using our approach, we were able to prove that we used the least possible amount of registers in order to satisfy the timing constraints. It is clear from Table 2 that the best reduction is obtained by considering register conflicts as well. Table 3 Average schedule freedom for FFT loops, CAS only
The third experiment includes only CAS, and concerns two loops of FFT algorithms. The first loop 39 (FFT) contains 40 operations, has a minimum latency of 13 clock cycles, and needs to be folded at least three times to realize an initiation interval of only 4 clock cycles. The second loop (Radix) contains 80 operations, has a minimum latency of 11 clock cycles, and is folded twice to realize an initiation interval of 4 clock cycles. In Table 3 the results are shown. The reduction obtained on FlYT by taking into account the register conflicts, is considerably more than in the second experiment. The run time of this experiment using the most extensive analysis, is less than a second on a HP 9000/735.
c0nc1usi0ns
In this paper, we presented an approach for code generation based on the analysis of precedence, timing and resource constraints. We also hinted at the way register conflicts are considered in CAS. By making all constraints explicit in a graph model and calcuIating the longest paths, we are able to see the interaction between the different constraints, and compute the effect on the schedule freedom available to a scheduler. The results in Section 5 show that the reduction in schedule freedom obtained by CAS is substantial. With loop foIding we are able to obtain more accurate reductions than the BSG approach in 163. Furthermore, in CAS it is easy to incorporate rules, such as shown in Section 4, that take register conflicts into account. BSG is not able to handle this type of conflict. We also showed that the obtained reductions realIy prevent a greedy scheduler from making a wrong decision. We have illustrated how the interaction with a scheduler works by showing the reduction in schedule freedom CAS can make as a result of a schedule decision. We conclude that analysis tools such as CAS are needed in order to obtain a feasible schedule when facing resource constraints, register constraints, and tight timing constraints.
