This paper presents an efficient encoding and automaton construction which improves performance of automata-based scheduling techniques. The encoding preserves howledge of what operations occurred previously but excludes when they occurred, a~owing greater sharing among schedutig traces. The technique inherits au of the features of BDD-based control dominated schedu~g including systematic speculation.
MSTMCT
This paper presents an efficient encoding and automaton construction which improves performance of automata-based scheduling techniques. The encoding preserves howledge of what operations occurred previously but excludes when they occurred, a~owing greater sharing among schedutig traces. The technique inherits au of the features of BDD-based control dominated schedu~g including systematic speculation.
Without conventional pruning, au schedties for several large samples are quicMy constructed.
Keywords
High-level~nthesis, scheduling, BDD, automata
2.~TRODUCTION
me schduling problem occurs across diverse ara of application from nehvorking to manufacturing to high-level synthesis of digital systems (HLS). Scheduling which assigns operations to time-slots in a synchronous system subject to data and control-flow dependencies as well as reource constraints is a key component of many HLS systems. Consequently solving this problem efficiently is a direct way to enhance the abilities of such systems. hfost solutions to the scheduling problem fall into two categori= i) heuristics and ii) integer Iin= programming (ILP). Heuristic schedulers (i.e.[ 1] [9]) find good solutions for large problems quickly but suffm with tightly constrained problems where~rly pruning decisions exclude mdidates lading to superior solutions. lLP schedulers (i.e.[3] [4]) exactly solve scheduling but have difficultieswith time complexity and control constraint formulation.
Heuristic and ILP scheduling methods produce a single schedule at a time. Finding this schedule, especially an optimal one, often becomes incr=ingly difficult as more constraints are added to the problem formulation. Symbolic methods, (i.
e [2][6][7][8][1O]) are
Pefission to tie di#hl or tid copies of d or pti of ti it~orkfor pemml or &<sroom use k~nted \titSrout fw pmtided tkt copies xe not =de or distib uted for profit or comerdd adk.atsge nd tit copies b-tkis notice md the W citation on tie fit paga To copy othatfie, to mpubhh, to pt.on semem or to red~tibute to kts, rqti~prior s@c -on red/or a fee. ICC~9S. Sm Jose. CA USA o 199sAChf l-5sii3&s-z9s/MI l.S5.m often effective in finding exact solutions in highly constrained problem formulations. Furthermore, since all solutions are enumerated, post-process pruning can be used to apply additional constraints which may not have eticient formulation for general schedula. Further, symbolic methods allow much more efficient formulation of control dependencies and environmental timing constraints. However, with symbolic methods, the key to success is to reduce the representation size of the solution sets. Methods to accomplish this include adding additional constraints to the prob Iem, pruning suboptimal candidates early in the s~rch, and eticient encoding techniques.
An exact symbolic scheduling technique was presented in [7] [S].
is method uses ROBDDS to dmcribe scheduling constraints and compress solution sets. In this formulation, =ch operation in a CDFG (Control Data Flow Graph) is assigned a bool=rr variable for ach time-step in the schedule.~is variable indicates whether or not the operation is scheduled during that time-step. Constraints, derivd from the CDFG and environment are added to the construction. Guard variablm are employed to distinguish control paths. Although this technique performed well, complexity prob Iems arose for lengthy schedules. Worse, since every exact history for all viable traces is kepg the encoding eficiency declines for schedules with many complex alternative histories.
Symbolic ROBDD automata-based schedulers were described in [2] [6] [1O]. In [6] , an exact operand scheduling technique is presented for predefine datapaths.~is technique allows opemnds to be lost and later produced again in order to find optimal schedules meeting tight memory constraints. In [2] [ 10], system timing and synchronization requirements are encapsulated in finite-state machine (FSM) descriptions. All constraints are formulated as automaton and product machines are built and traversed.~is product machine becomes prohibitively large for practical sized prob Iems. Furthermore, causality (as checked for by causal validation in Section 4.2) is not confirmed.
In this paper we present an exact symbolic automata-based scheduler. Our immediate innovation is an efficient encoding and automaton construction which improves performance of exact symbolic scheduling techniques. Fundamentally, this technique groups together schedules with common although not necessarily identical historia when exploring the schedule solution space.~is is accomplished using an encoding which only preserves whether or not an operation has been scheduled but not precisely when. In this way, we minimize the problems of [7] for long schedules and of [2] [1O]with regards to automata size.
We pursue an automata-based representation since it provides clear potential [2] [1O] for describing control and protocol-intensive cycle-varying systems.~is ability is a key part of our long-term HLS goal. Fufiher, we utilize exact symbolic ROBDD techniques because of their demonstrated success in finding all optimal con-_ ... .
. . -strained solutions [7] [S]. JVe wish to apply these techniqu= to constrairrd critical portions of large scaIe system designs. Such systems are fiiled with complex subsystem interactions lvhich may be amenable to Bool=n automata representation.
CDFG BOOLEAN FOWULATION
\Ve define a CDFG as a directed graph where nodti denote operations, forks or joins and arcs represent dependencies. Fig. 1 shows a simple CDFG. In this example, the directed arcs from operation 1 to the fork and join denotes a control dependency. Consequently, the resolution of the fork and join remains unknown until after operation 1 has been scheduled. It is important to note that operations 3 and 4 can be speculatively executed before the fork is resolved. In this everr6 operations 3 and 4 must both be scheduled regardless of the control resolution. The right-hand side of Fig. 1 shows this speculative tmnsfomation. Finally, the directed arc horn operation 1 to operation 2 represents a data dependency. Hence operation 2 can only be scheduled afier operation 1. In our formulation, data and control constraints are extracted from a user supplied acyclic CDFG. These constraints along with user supplied resource resti.ctions are used to construct a BDD-based boolean relation. Implicit state-traversal techniques are used to determine a valid schedule.
Encodtig
%ch operation j in the CDFG (excluding forks and joins) is encoded with exactly two ( P} N. ) boolerm variables. Control values to determine which side of a fork orjoin is used are produced by specific operations in the CDFG. These operations are encodd as described previously but have an additional pair of varjabl~( pGj, NGj)associated with them. These guard variables indicate whether the producd control value is true or false.
Constrakts
Six constraints, described below, are identified from the CDFG or supplied by the user and constructed as ROBDDS. These constraints are constructed in the complemented sense each constraint describes situations which would not exist in a valid schedule given the encoding of Section 3.1. Once constructed, the product of the complement of all six constraints forms the desired scheduling transition relation.
Dependency Constraints
Dependency constraints impose an ordering of operation execution. To ilhrstrate, if operation 2 requires a result produced by operation 1, it can only be scheduled after operation 1. In this case, it~vould be illegal to have any minterm in the relation containing PINZ. Furthermore, an operation may have more than one dependency. In this case, aIl dependencies must be resolved before this operation can potentially be scheduled. In general, illegal dependency minterms are enumeratd by,
;i~where i+j is a dependency arc in the CDFG.
i +j
A dependency arc in a CDFG may pass through one or more joins. This dependency only applies if the control is unresolved or the control has resolved in its favor. In general, illegal dependencies through joins are,
where a is the set of all operations producing control values for joins through which the arc i+j passes. Furthermore, the value of the guard PGk is assigned the complement value of the join resolution for arc i+j. For example, if i+j is a dependency arc passing through the true side of a join resolved by operation k, then Pck must be false. 
Resource Constraints
In practical digital system designs, there are only a fixed number of function unit resources available to perform a given task. Consequently, only limit operations of a given resource maybe scheduled in any cycle. For example, if ordy one ALU is available and operations 1,2 and 3 each require an ALU, it would be illegal to schedule any two or more of these operations in a single cycle. 
Histo~Constraints
As an initial simplification, we require that once an operation has been scheduled, its result will always be available in the future. This excludes encodings shown in row 3 of 
Exclwion Constraints
If a conhol value has been resolved, it is unnecessary to schedule any operations unreachable under the resolved control for a particular trace. For instance, in Fig. 4 , if operation 1 has been scheduled, then it is no longer necessary to schedule operations 2 or 4 in any trace with a false control resolution. Alternatively, any trace with a true con~ol resolution must still schedule operations 2 and 4 but not operation 3. It is important to note that to support speculative execution, we must not exclude operation 2 or 4 entirely horn any trace with a false control resolution but only exclude them from being scheduld in any false control resolution trace that hasn't yet scheduled them. In general, illegal minterrns are,~i 
3.2,6 Immediaq Constraints
It is desirable to have a constraint which implies an operation must be schduled in the immolate cycle after another operation. This allows for multicycle and pipelined unit. Let Sow) be assigned the state with no knowledge of any schcduld operation (all Pi = O). The set of reachable states on the jth iteration of the clock may be determined from this starting point by iteratively computing, Sj(v') = ,jv [sj_, (v) na (v, v')] (s)
Completeness
Completeness is achieved when at some cycle j, there is a set TGSj(~such that each t= T has scheduled the termination operation. (A termination operation, Pf which depends on all paths exiting the CDFG, is added for convenience.) Furthermore, all CDFGimposed control paths must be scheduled by at l-t one t=T. Although completeness is necessary for the existence of an ensemble schedule, it is not sufficient. A complete ensemble must contain a set of traces which are both complete and form a causal (deterministic) schedule. Note that given resource limits, even a complete set of tra= on some cycle does not guarantee that any ensemble schedule is causal and can terminate on that cycle.
Causal VaHdation
Trace validation ensures that each trace is part of some ensemble schedule. Consider the CDFG of Fig. 5 with a one adder (solid circle) constrain It is possible to hoist the addition operation 2 past the true fork and schdule it and operation 1 in one cycle. Likewise, another trace could hoist the false add (operation 3) past the fork and schedule it and operation 1 in one cycle. Together at cyclej=l, both these traces form a set T satisfying the conditions for completeness. It should be cI= that this r~ulting ensemble schedule is not causal since two additions cannot be scheduled speculatively in the same cycle given a single adder constraint. Unfortunately, afier removal of such traces, ensemble scheduling sets may no longer be complete. Thus, validation must continue until a fixed point is reached and all traces belong to some valid ensemble schedule. An ensemble is trace validatd using the algorithm of Fig. 6 . This algorithm explicitly describes trace validation only progressing backward through time although there is an additional symmetric portion for forward validation. Intuitively, this algorithm ensures that at =ch time-step, there is a modified transition relation which a[lows condition producing operations to be scheduled if and only if there are transitions with matching common histori= for both true and false resolutions of the condition. Thus to be causal, operations speculatively executd assuming a true outcome must also have been speculatively executed assuming a false outcome. Forward and backward validation is performed on the entire ensemble set until a complete pass with no pruning of that time-step's transition relation occurs. This algorithm originated in [7] and is shown here modified for an automata formulation.
S;{V9=T 
Schedfig Instances
A complete and validated ensemble implicitly contains every schdule of cycle Iengthj. It is possible to greedily pick a single schedule from this set. Beginning with SO(V),we choose a state at random. This present state maps to next statm in S,(V) via a validated transition relation S for that time-step. We greedily pick a valid next state with maximum P,=l implying pwk utility. This process continues until the termination operation has been scheduled at time-stepj. If the picked present state-next state mapping implies that a condition is resolved, then two traces, one for a tme resolution and one for a false resolution, must be propagated forward from this point. Trace validation ensures that there will always be two such traces with opposite r~ohrtion to choose when a condition is resolved.
Greedy schedule selection is not the only possible selection method. Since a complete and validated ensemble implicitly contains every schedule of cycle length j, it is possible to pick a schedule that better suits the designers needs. For example, schedule selection methods which simpli@ control or minimize power could be applied at this poin~Furthermore, there is no need to stop at cyclej once completeness and trace validation have bwn verified. Additional SJ~may be added to the ensemble without the need to recheck completeness and trace validation. In this way, a schedule of specific length and exact construction can be found to suit a designer's requirements.
Schedules tith Cycle-Length Constiatits
The scheduling technique described in Swtion 4.3 us= no cycleIength constraint If a cycle-length constraint is added, then ALAP bounds can be applied.~ese ALAP constraints prone the number of traces in S,{~sets near the end by removing traces failing ALAP bounds. Furthermore, since traces have been pruned, it is possible to apply trace validation early for additional pruning.
Several other kinds of constraints including heuristic constraints are also applicable. However, the goal is to keep the smallest ROBDD sizes for the problem. With the various constraints experimentally tri~including ALAP, it turns out that since the initial encoding is relatively efficienL such pruning often reduces the eficiency of the scheduler by incr-ing the complexity of the representation even though the number of traces is reduced. Future work will be n~ed to sensibly apply appropriate constraints.
5.~SULTS
A tool was developed to demonstrate the feasibility of our scheduling technique. This tool utilized an in-house BDD package [5] and was run on a 141MHz SPARC Ultra with 416MB of memory.
Results are described for several DFGs and CDFGS found in the literature. In all cases we are only applying the constraints described in Swtion 3.2 -no additional pinning strategies, heuristic or otherwise have been included. Furthermore, no prior knowledge of the cycle-length is resumed and hence no ALAP bounds are applied. In most cases, the eficiency of our encoding alone allows us to outperfom similar symbolic techniques. All times are reported in seconds with the lower time indicating runtime without BDD ordering (preordered) and the higher time indicating runtime with BDD ordering (sifting). These times are inclusive of our entire scheduling process. (Constraint construction and other problem setup costs are not left out.) All operations are single cycle except for multipliers which are two-cycle and in some cases tw~cycle pipelined.
DFG Results
The elliptic wave filter (EW~and fat discrete cosine transform (FD~are widely accepted DFG benchmarks. Table 2 presents our r=ults for various configumtions of th~e benchmarks. EWF-1 is the standard 34 operation single iteration of the elliptic wave filter. EWF-3 is three and EWF-6 is six iterations of the elliptic wave filter unrolled. Here the eticiency of our encoding becomes apparent. Even though there are now as many as 204 operations and schedule lengths of up to 104 cycles, we are still able to prd uce all exact solutions to this problem in reasonable time. The small cycle-length gain achieved by loop unrolling suggests that this benchmark is tightly constrained with many schduling traces sharing common histories and hence well suited for our encoding. A more challenging case is EWF-2X2 (136 operations) with two copi~of the elliptic wave filter in parallel each unrolled twice. FDCT-1 (42 operations) is also a formidable benchmark with its inherent parallelism. FDCT-1x2 (S4 operations) adds an even higher degree of parallelism by requiring two copies of FDCT be $ scheduled under the same resource constraints.
Our results compared to other symbolic techniques [10] [8] show a -100 speedup for frquently reported benchmarks EWF-1 and FD~-1. Although [S] reports some exact solutions for EWF-3, we are unaware of any reports for exact solutions to EWF-6, EWF-2X2 or FDCT-1x2. Our formulation uerforrns sumnsinzlv -. well on DFGs. We attribute this to the fact that no valida-tionstep must be done since a DFG schedule consists of a single execution trace. Furthermore, thr benchmarks indicate that our method works better under highly resource constrained situations which limit the possible combinations of common histories for scheduling traces. For example, FDCT-1 with a 1 ALU and 1 multiplier constraint finds a rmult in less time than when run with a 1 ALU and 2 multiplier constraint even though the final schedule is longer. Table 3 shows results for commonly referenced CDFGS KIM (24 operations, 2 conditions) and MAHA (1S operations, 6 conditions). All schedules for these benchmarks are found in just a few seconds. More challenging CDFGS, ROTOR and S2R horn [7] , . are shown in [7] . Our formulation shows the best improvement in cases with longer schedulm. For example, the 12 cycle ROTOR result is achieved -10 times faster (on identical computers) with our formulation.
CDFG Results

