AbstractÐThis paper presents an exact technique for scheduling looping data-flow graphs that implicitly supports functional pipelining and loop winding. Automata-based symbolic modeling provides efficient representation of all causal executions of a given behavioral description subject to finite state bounds. Since a complete set of scheduling solutions is found, further incremental refinements, such as sequential interface protocol constraints, can be easily accommodated. Efficiency in the implementation is maintained by careful formulation of the automata and by judicious exploration techniques. Results are presented for traditionally referenced benchmarks, several large synthetic benchmarks, and a practical industrial example.
D
ESPITE the efforts of many talented researchers in behavioral synthesis, the great majority of current industrial designs are synthesized from the register transfer level of description. The RT-level differs from the behavioral level of abstraction primarily by the imposition of a clocked, timed, sequence of activities, or schedule. Thus, descriptions at the RT-level are cycle specific, i.e., it is possible to simulate the description and know exactly what operations and datatransfers occur on each cycle. Theoretically, a behavioral specification only describes the desired behavior and required sequencing constraints and therefore greatly enhances the space of possible designs which could be synthesized from a specification. (We distinguish timing (implying absolute time) from sequencing (implying cycle-accurate ordering).) Further, since the designer is freed from the responsibility of describing exactly what activities are simultaneous, his description should be simpler to create, enhancing his productivity. In practice, however, it is exceedingly difficult to efficiently make use of this extra design freedom. We believe that this is caused by several fundamental problems with the conventional behavioral model:
. Practical sequencing constraints (required to meet IP module and interface protocols) are difficult to specify and tend to dominate the control structure of the behavioral description. . Current scheduling techniques are unable to fully explore all sequencing possibilities available in a design given required sequencing and communication protocols.
. The behavioral model of operators is too simplistic to capture necessary timing and sequence constraints in the hardware realization. The effect of these problems is that a behavioral description can be very inaccurateÐso much so that the results of the optimizations are unusable. Contrawise, once sufficient structure has been added to the description to capture the requirements, the synthesis tool is overconstrained by incidental dependencies which are artifacts of the specification process. It thus produces a design with lower than desired performance. Last, even if the description were ideal, conventional scheduling techniques cannot adequately explore the set of possible implementations which again leads to inefficient design.
To make these ideas more concrete, consider a behavioral specification of a conventional microprocessor. Although construction of a program which has the same functional behavior is relatively straightforward, a practical implementation needs to efficiently communicate with a memory hierarchy and may have rather stringent sequencing requirements for interrupts, traps, exceptions, and bus handling. To a large extent, the performance of a microprocessor is fundamentally dependent on how well it makes use of the inherent sequencing of its pipeline, memory, and I/O interfaces. With large system-on-chip (SOC) designs, such sequencing issues have moved inside the chip boundary with many modern designs making use of internal packet busses between subsystems to enhance the overall performance of the design.
Overview
In this paper, we describe an efficient systematic technique to schedule a looping data-flow graph (DFG) meeting sequencing constraints defined by a set of local nondeterministic finite automata (NFA). The DFG describes the set of operand dependencies and causal requirements for the design, while sequencing constraints specific to the operations, storage facilities, and I/O protocols are described as sets of NFA. Given these inputs, we iteratively refine a composite NFA model which implicitly encodes every causal sequence of control states not violating the dependencies or sequencing constraints. This is represented and manipulated using Boolean symbolic automata techniques borrowed heavily from formal verification and symbolic model checking. Subsequently, an automata-theoretic pruning of the design sequences is performed to remove all sequences not meeting performance, resource, utility, or other optimization criteria. By use of a careful construction process, these steps can be carried out exactly for designs of practical scale with the result encoding all optimal solutions. This model has the practical advantage that any sequencing constraint which can be efficiently described as an implicit NFA model is applicable. Since such representations are at the heart of current formal verification techniques, one can have reasonable expectation that the particular constraints of a given synthesis problem can also be described.
Resource-constrained scheduling is an intractable problem. For any exact technique, time and/or space requirements quickly become issues. In our results, Section 5, we show that, even though symbolic scheduling is exact, it readily competes with or outperforms other techniques in terms of problem size, computational resources, and/or solution quality. In its current formulation, symbolic scheduling is particularly useful in very highly constrained problems. Still, symbolic scheduling does not overcome the NP-complete nature of resource-constrained scheduling. Fortunately, symbolic scheduling's local NFA models and compositional construction are amenable to partitioning, abstraction, and hierarchy. Although not the focus of this paper, these traits potentially enable symbolic scheduling applications in extremely large designs.
Although required in many practical designs, controldependent scheduling requires several additional specialized techniques [7] , [9] and is not addressed in this paper for reasons of brevity and simplicity. We discuss relevant previous work in Section 1.2. Section 2 formalizes the terminology and defines the particular scheduling problem to be solved. Section 3 is the heart of the paper and describes a general construction for modeling a scheduling problem as a nondeterministic finite automaton. Section 4 shows a minimum iteration latency/control step application of the the automaton built in Section 3. Finally, Section 5 describes results for traditionally referenced benchmarks, several large synthetic benchmarks, and a practical industrial example.
Related Work
Scheduling is a well-studied problem with a rich literature of previous work. Heuristic scheduling techniques are by far the most common. Pioneering work in DFG loop scheduling and pipelining was done by Girczyc [5] , Paulin and Knight [17] , and Goosens et al. [6] . In general, heuristic techniques produce good results quickly for well-studied and finely tuned examples, but have yet to gain wide use and acceptance in practical design settings. Two notable recent heuristic scheduling techniques are by Lakshminarayana et al. [13] and Dos Santos et al. [3] and focus primarily on control-dependent scheduling. Lakshminarayana et al.'s technique, which handles looping behaviors, uses an explicit breadth-first elaboration of the available operations on each time-step that is similar to the implicit FSM exploration used here. Dos Santos et al.'s heuristic is guided by code-motion pruning and includes some forms of speculation, but does not handle looping behavior. Neither of these works directly address or present results for DFG loop pipelining and are hence difficult to compare with this work.
The technique presented in this paper is more directly compared to noteworthy heuristics for loop scheduling without control [1] , [14] , [21] , [23] . Chao et al.'s rotation scheduling [1] uses a series of transformations to perform DFG loop pipelining with function-unit constraints but ignores other practical design constraints. Lee et al. [14] , Sanchez and Cortadella [21] , and Wang and Parhi [23] all require an initial prespecified iteration latency and then adjust for function-unit constraints. Lee et al. employ ASAP scheduling and then resolve resource constraint violations. Sanchez and Cortadella compute the minimum initiation interval of the loop and then iteratively retime, schedule, and adjust resources. Wang and Parhi use a novel cyclefinding and covering scheme. Although Sanchez and Cortadella do include register usage, these last three techniques also fail to address how other practical design constraints are incorporated. Although not exhaustive, these four references represent the best heuristic scheduling techniques and results (quality, quantity, and application) we found in the literature for DFG loop scheduling and pipelining which do not rely on restructuring of the scheduling problem.
Exact techniques for scheduling date from Dijkstra, with the best-known based on integer linear programming, ILP [4] . Although shown to be relatively efficient, ILP techniques do not readily generalize to control-based scheduling and have formulation difficulties with sequential constraints other than pure delays. BDD-based exact scheduling was pioneered by Radivojevic [20] . His work includes loop DFG scheduling, but requires a prespecified iteration latency. Further, his work is based on a simple nonsequential operator model which cannot be generalized within the scheduling framework. Takach and Wolf [22] introduced the notion of Behavioral FSM scheduling. However, they have substantial difficulty constraining resources for general BFSM models. Automata-based symbolic scheduling was introduced by Coelho and De Micheli [2] and Yang et al. [24] in subsequent publications. These formulations do not address loops in a generalized way. Monahan and Brewer [16] proposed an automata-based scheduler for predefined data-paths subject to limited memory constraints. Their work also does not address looping behavior.
The primary contribution of this work is a new automata-based symbolic formulation and exploration of the looping scheduling problem without control. Although previous work is quite mature in regards to function-unit constrained loop DFG scheduling, an advantage of this technique is that sequential constraints (IO protocols, dependencies, etc.) and concurrency constraints (limited function units, interconnect, registers, etc.) are generalized. Exact schedules satisfying all constraints in concert, as is required in practical designs, are determined. Finally, this paper also lays a foundation upon which control-dependent loop scheduling is built [9] .
MOTIVATION
In this section, we first present a simple scheduling example, described both functionally and graphically. This is a working example used throughout the entire paper. Next, we show how all valid execution sequences (given problem and model-imposed constraints) for the working example may be encapsulated by a nondeterministic finite automaton, NFA. Finally, we summarize the NFA model construction and exploration process.
Working Example
Algorithm 1 is a functional description of the working example. For each iteration of the loop, the subsystem implementing this behavior reads three input values and writes one result. Furthermore, the result of the multiplication, rv2, is required by an earlier addition and, hence, a data dependency between different iterations of the loop, an interiteration dependency, exists. Consequently, rv2 must be initialized upon entering the loop.
Algorithm 1 Example functional description rv2 = 0 while (TRUE){ i0 = read(); i1 = read(); i2 = read(); rv0 = i0 + i1; # Operation v0 rv1 = rv0 + rv2; # Operation v1 rv2 = rv1 * i2; # Operation v2 write(rv2); } Fig. 1 shows a data-flow representation, DFG, of Algorithm 1. Each vertex represents an operation such as an add or multiply. Each directed edge represents a data dependency or operand communication. Read and write operations and associated data dependencies are not represented in this simple DFG. Instead, input and output operands are assumed to be unbuffered. Hence, external read and write operations occur when input and output operands are required or produced by internal function units. (A more general IO model is developed in Section 3.1.) Finally, the reverse edge from v2 to v1 represents the data dependency between successive loop iterations. (All operations or DFG vertices must be executed once per loop iteration in the class of problems presented in this paper. Behaviors which require some operations to execute only once, as in a precomputation of a coefficient, can be modeled by combining cyclic models from this paper with acyclic models from an earlier paper [7] .)
Correctly scheduling this DFG requires assigning each operation to a time-step while observing several criteria. First, all data dependencies must be observed. Second, resource bounds, such as one available adder or maximum four simultaneous external data transfers, must be adhered to. Finally, a scheduling objective, such as minimize iteration latency, typically guides schedule selection.
If one single time-step adder, one single time-step multiplier, and four simultaneous external data transfers are allowed, then the example's only minimum iteration latency schedule is shown in Fig. 2 . Although the delay, or required time-steps for a single loop iteration, is three, the iteration latency, or time steps between successive loop iterations, is only two. (Minimizing iteration latency in this case is equivalent to maximizing throughput.) This loop winding is possible because operations from successive iterations are allowed to overlap, as seen with v2 and v0.
Assigning operations from different loop iterations to the same time-step is necessary for loop pipelining and loop winding. Although loop pipelining and loop winding are conceptually similar, we make the distinction based on the interiteration dependency depths. Some scheduling problems contain constraining data dependencies which only permit minimal loop iteration overlap. Optimal solutions for these cases use loop winding. Other scheduling problems contain considerable interiteration data dependency freedom and permit extensive loop iteration overlap. Optimal solutions for these cases use loop pipelining.
Definition 1.
A scheduling problem is defined as Y iY Y . Each vertex v P represents an operation producing one result operand per loop iteration. The set of vertices contains two subsets, input Y output , which identify operations requiring external input operands or producing external output operands, respectively. Each directed edge uY v P i is an operand and represents a data dependency of successor v on the result produced by predecessor u. The set of edges i may be partitioned into two disjoint sets i inter i intr i, where any e P i inter is a data dependency between successive loop iterations and any e P i intr is a data dependency within the same loop iteration. Each oundY r P is an ordered pair where natural number ound is the maximum permitted concurrent uses of class r resources and r is the set of all operations requiring a class r resource at some time. is a set of sequential constraints (i.e., IO protocols) described as NFA. These NFA are formally described in Section 3.1. A solution to the scheduling problem assigns each vertex v to a time-step such that no dependency, resource, or sequential constraints are violated.
Given this definition, a scheduling problem has a potentially infinite number of solutions. Suppose that a scheduling solution takes n time-steps. It is often possible to add a delay at some point in the schedule and thus require We desire to model schedules as instances of finite state automata. This will require constraints built into the modeling which can be relaxed in a controllable fashion at the cost of representation size. Model construction (described in Section 3.1) describes these constraints and the means for relaxing constraints as appropriate.
Automata-Based Solutions
Symbolic scheduling constructs a composite modeling automaton, a CMA, that encapsulates all solutions for a given scheduling problem, SP, subject to problem and model-imposed constraints. Once a CMA is constructed, exploration techniques are used to find particular schedules meeting some objective function. For the working example, the desired CMA is explicitly shown in Fig. 3 . (In practice, a CMA is represented implicitly with Reduced Ordered Binary Decision Diagrams, ROBDDs.) Each edge in this nondeterministic state graph represents a time-step. Operations assigned to a time-step are identified through edge labeling. As discussed in Section 2.1, loop winding and pipelining are possible when operations from successive loop iterations are assigned to the same time-step. The distinction between successive iterations or iteration sense is made with the symbol ª~º. If an operation is labeled with no ª~º, such as vH, then vH $ represents the same operation in the successive iteration and vice versa. Operations labeled with ª~º are referred to as odd iteration operations and those without ª~º as even. In Fig. 3 , the minimum iteration latency schedule is highlighted with dashed edges. Two iterations (for both iteration senses) are scheduled in one complete traversal of this cyclic dashed path. Edges denote scheduled activities, while states encode in which sense operands currently exist in the design. As will be seen in Sections 3 and 4, any cyclic path through a CMA which executes all operations is a valid steady-state schedule of the loop. (The reader can verify that the schedule with latency and delay of three also exists in this CMA.)
The bulk of this paper describes how, given a scheduling problem, a corresponding CMA is constructed, refined, and explored. The result will implicitly contain every scheduling solution. Finally, a deterministic witness schedule is chosen at random with the properties of minimum iteration latency and minimum number of control steps. Time-steps in a witness schedule may be directly translated to states of a FSM controller.
COMPOSITE MODEL CONSTRUCTION
In this section, we describe, in bottom-up fashion, how a scheduling CMA is constructed through local specification, composition, and refinement. First, each operation of the DFG (v P in the scheduling problem) is modeled by a small nondeterministic finite automaton or modeling NFA, MA, which represents local sequential behavior. Modeling NFA concepts, terminology, and applications are discussed in detail. Next, the Cartesian product of all MA form an initial composite modeling automaton, a CMA, which is a completely connected state-transition graph. This initial CMA is iteratively refined by dependency, resource bounds and reachability constraints to prune illegal states and transitions. The final CMA only encapsulates causal, resource-bounded executions of the scheduling problem. Finally, the steps described in this section also serve as proof of correctness by construction as a CMA is specified, composed, and refined via application of formal statements.
Modeling NFA
Modeling NFA, MA, are the fundamental building blocks used in the composition of a CMA. Every operation v P in the scheduling problem (not function unit) is modeled by its own instance of an MA. (This is a fundamental distinction from other symbolic modeling techniques as we are primarily modeling behavior rather than implementation.) An MA's sequential behavior captures the sequential behavior exhibited by the use of an appropriate function unit. In this way, an operation's local timing constraints are specified via its MA. Finally, an MA's state represents existence or nonexistence of result operands.
An MA may be thought of as a black box representing some function unit or design subsystem. Abstractly, it accepts and produces operands in some sequential fashion. Externally important events and operands are referenced using a labeling system. Table 1 lists and describes a subset of labels used in this paper. Labels consist of two grammatical parts: a subject followed by a predicate. Valid subjects include operands and resources. In the table, operands are represented by the generic name info, but are eventually replaced with unique operand names. (From a DFG perspective, DFG edges are operands.) Resources are 
TABLE 1 Label Definitions
represented by the generic name res, but are eventually replaced with appropriate resource names. Examples of resources are ALU, multiplier, bus1, etc. Valid predicates are defined in Table 1 . A predicate's purpose is to describe subject properties. With this background, it should be clear that the label rv0 known identifies operand rv0 as present in the system and label ALU busy identifies that an ALU is currently occupied.
A simple MA might specify the use of a combinatorial ALU. This behavior requires two input operands at the beginning of a clock cycle and produces a result operand by the end of the clock cycle. Furthermore, to do this requires occupying one ALU hardware resource. Fig. 4 shows a sparsely labeled MA representing this single time-step function unit. Execution begins in state(s) labeled opc unknown. If opa is present, opb is present, and an ALU resource is available, this machine may nondeterministically choose to transit to state(s) labeled opc known. The nondeterminism allows this MA to delay execution in favor of another MA when in a composition. This enables full exploration of the solution space and, hence, exact results.
Several properties of Fig. 4 's MA are important to note. First, a synchronous system is assumed with clock-period activity and duration corresponding to all MA transitions (not states). (Typically, a transition corresponds to a clock period. This may be generalized to correspond to clock period phases and thus allows some types of operation chaining.) The transition labeled alu busy requires and occupies an ALU resource for the entire clock period. On the other hand, states correspond to system knowledge present at a clock edge. Second, state encoding and labels are distinct. In fact, one state or transition may be referenced by multiple labels and one label may reference multiple states or transitions. Third, this particular model [7] represents an acyclic computation of opc. In this acyclic model, operand opc may be computed only once and, in fact, persists forever in the system.
With looping behavior, an operation must be scheduled multiple times and, hence, iteratively produces result operands. Unfortunately, keeping track of multiple coexisting results for one particular operation quickly becomes complex and costly. In contrast to the conventional models, we choose to bound this complexity and cost in the formulation of a cyclic MA. At a minimum, two successively produced result operands (odd or even) must be distinguished. This can be done with two separate acyclic MA, but will require additional state. Instead, two acyclic MA are overlaid on one set of state encodings, as shown in Fig. 5 . By doing this, a single state bit can distinguish between known/unknown operands in the odd or even iteration sense. Complexity and cost are bounded because an operand may only exist in one iteration sense at any given time-step. Since states labeled info known or info known $ are mutually exclusive by construction, it is impossible to simultaneously have knowledge of an operand in both iteration senses. The ªpigeon holeº analogy provides another way to think of this. A cyclic MA reserves one pigeon hole per operand and each pigeon hole has room for one operand. Relaxing this model-imposed limit (i.e., allowing several possible instances of some operand to be present in a single clock cycle) may benefit some scheduling problems. In Section 5.2.1, we show how additional MA may be used to potentially improve solution quality at the expense of greater scheduling complexity.
In practice, an MA is often more complex than what is shown in Fig. 5 . Fig. 6 shows a completely labeled two timestep nonpipelined MA for cyclic behavior. Notice that the known and known $ states are now separated by two transitions. Furthermore, each state and transition contains a dual, or symmetric states, transitions, and paths in the opposite iteration sense. Duals are required for all cyclic MA in this formulation. A subtle change in labeling of Fig. 6 converts this to a two time-step pipelined MA for cyclic behavior. If the solitary busy labels are removed, then resource use only occurs during the required-labeled transitions. In this way, a bound on operands entering a resource class is imposed, as is the case with a simple pipeline.
Because an MA is nondeterministic, it easily encapsulate alternatives. For example, some DFG operation may be implemented by a three time-step multiplier or by two executions of a single time-step adder. This freedom may be directly specified in an MA, as shown in Fig. 7 . Nondeterminism is exploited to provide two (or more) paths from a known to known $ labeled state(s) and symmetrically vice versa. The better choice, if one is better, is discovered during exploration of the composition. The MA applications discussed so far directly model sequential behavior of a DFG operation. A more general MA application models arbitrary local sequential constraints. A scheduling problem may represent a portion of a larger design that must interface to the remainder of the design via specific IO protocols. In this case, the IO protocol constraints may be represented as several MA and are elements of P . For example, suppose a designer knows that his subsystem must communicate through one IO port and alternate between input and output transactions. Furthermore, an arbitrary delay is permitted between input and output transactions. In typical applications, a designer relies on an MA library of base sequential behaviors representing available IP blocks or function units. Additional MA are specified when new local sequential behaviors are encountered. Typically, designers strive to simplify subsystem interfaces and protocols in practical design. Because of this, only a handful of explicit states and transitions are usually needed to specify any subsystem MA. A completely connected nondeterministic Bu È chi automaton (accepting infinite sequences) with two symmetric halves (duals) and appropriate labels is specified. The designer uses nondeterminism in two ways: to specify places where arbitrary delay may occur and to specify alternatives in sequential behavior. When behavior becomes too complex to specify with a handful of states and transitions, the behavior can often be decomposed into interacting simpler processes represented by several MA. H is a successor of s. s is the set of all operands, info P s, produced or sinked external or internal to an MA. is the set of all resource types (function units, buses, local registers, etc.), res P , which may be required in production of an MA's operands. v is the set of all labels, lel P v, used by an MA. Labels identify state or transition sets of an MA with particular properties. A lel is of the form subject predicate, where subject is an element of s or and predicate is an appropriate property as described in Table 1 . v is a family of labeled sets of states. lel P v, where lel , and is the set of states referenced by label. A single state referenced by label is denoted s lel . v is a family of labeled sets of transitions. Á lel P v , where Á lel Á, and is the set of transitions referenced by label. A single transition referenced by label is denoted lel . When referring to all transitions to or from a set of states lel , the notations ÀY lel and lel Y À are used, respectively. Finally, a path in an MA is a potentially infinite sequence of states, s H Y s I Y s P Y F F F such that, for each successive pair of states s i Y s iI P Á.
The Composite Modeling Automaton
The behavior of each DFG operation vertex v belonging to the scheduling problem SP is modeled by an MA. The Cartesian product of all these MA forms an initial composite modeling automaton, a CMA. Suppose that each DFG vertex from the example in Fig. 1 is modeled by a single time-step cyclic MA from the bottom of Fig. 5 . Fig. 3 explicitly shows the example's CMA. (This technique is fully symbolic and implicit ROBDD representation is used in practice.) The state of each MA is still represented by one bit in each CMA state vector ordered v2, v1, v0. To illustrate, consider the transition from state 100 to state 001. The bit representing v2 changes from 1 to 0, which identifies that vP $ is scheduled (inputs required, resource busy, etc.) on that edge. Also, the bit representing v0 changes from 0 to 1, which identifies that v0 is scheduled. Finally, notice that the CMA in Fig. 3 is created for each lel P v of a CMA. The set union of all lel forms the composite family of sets v. Likewise, a labeled set Á lel P v ) for a CMA is constructed as Á lel mPw mXÁ lel when mXÁ lel exists. A labeled transition set Á lel is created for each lel P v of a CMA. The set union of all Á lel forms the composite family of sets v .
All sets of states, P we and P gwe, are encoded with a set of binary vectors f rueY p lseg n . When computing the Cartesian product, binary state vectors are shifted so that state bit encodings of different MA never share variables and, hence, never overlap in the composite ROBDD structure. Hence, a complete (never empty or partial) Cartesian product always results. Since each MA is encoded with a unique set of binary variables, all functionality, including LS and LT, of some composite-member MA is preserved and available in a CMA. Consequently, a CMA's states and transitions can still be referenced by composite-member MA labels. For example, the CMA transition op required still refers to required-labeled transitions of an MA requiring the operand named opc. When an MA is used to model an operation, local operand names such as info are replaced with appropriate global names. After the initial composition, a CMA is iteratively constrained by refinement and, hence, gwe i T gwe j for iterations i, j when i T j. For brevity and clarity, a CMA is often denoted without any iteration index and the particular refinement iteration must be inferred from the description.
Dependency and Capacity Constraints
Every path in the working example traverses a v0 edge before a v1 edge. This is because v1 depends on the result produced by v0. In the initial completely connected CMA, an edge exists between state 000 to state 010. The transition from 0 to 1 in the v1 bit position indicates that operation v1 is scheduled. However, v0 is a 0 in the present state, indicating its result operand is known $ . Consequently, the result operand that v1 depends upon is not available in the correct iteration sense and v1 cannot be scheduled. This edge and other acausal edges are pruned from a CMA by dependency refinements.
In
$ HY À in the odd iteration sense. This too must be built for both iteration senses. Dependency implications are built for all edges uY v P i in the scheduling problem and intersected with a CMA's transition relation Á to prune all acausal edges.
For a single labeled transition and single labeled state, dependency is modeled by the implication,
In an ROBDD, implications may be built as p A q pq. This construct insures that a transition labeled info required may only occur if the required information is present in the system. As before, the labeled state s info known in the present state indicates that the required information exists in the system. Another way to view this is that a present state labeled s info known enables an information accepting info required Eleled transition. For a single operand, info, (1) must be built for all appropriately labeled transitions and states as
Let info represent (2) for one piece of information, info. The refinement of Á for a CMA is
Equation (3) only shows refinement of Á P gwe. All refinements are defined on Á, while refinements of other sets of any MA are implicit. (In practice, P gwe is never explicitly stored, but always determined from Á.) Operation v0 has no dependencies. It is possible that v0 may schedule in the even iteration sense and then in the odd iteration sense before v1 has a chance to use the first even iteration result. A capacity constraint ensures that a particular operand has been consumed by all dependents before it is forgotten. The capacity constraint of v0 on v1 is built as HY I vH A ÀY H vI . This implication says that if v0 forgets its result, then all dependents must have accepted the result by the next time step. This can be expressed formally as
As with the dependencies, capacity refinements in both iterations senses are built for each edge of the scheduling problem and intersected with Á to prune invalid transitions and states.
Resource Bounds
Since both v0 and v1 require an adder resource and only one adder resource is available, it is illegal to assign these operations to the same time-step. Enforcing this corresponds to removing edges from the CMA where both v0 and v1 are busy (bits for v0 and v1 are changing simultaneously). This may be built by enumerating all combinations of 0 up to bound busy-labeled transitions for a particular resource class. At first glance, this constraint appears to be exponential, but requires only time and nodes proportional to ound Â jÁ resoure usy j when built as an ROBDD. Let e res usy be the set of all combinations of at most bound transitions P Á res usy . Then, This construct may be generalized to bound other types of transition and state concurrency in a CMA. In the working example, transitions requiring external data transfers may be labeled and bounded to impose an external bus bandwidth constraint. States labeled info stored may be bounded to limit the number of ªliveº intermediate results and, hence, local storage. In this way, register lifetime analysis is implicitly encapsulated in a CMA. Finally, concurrency constraints need not be applied in a flat, single level manner to one type of resource. Instead, concurrency constraints can be applied in a hierarchical fashion. This allows complex sharing and overlap of resource use. When applied to interconnect, these types of hierarchical concurrency constraints have been used to guide multiplexer use.
In typical applications, the labels info stored and info known are used interchangeably. This results in reduced state, but leads to implied model limitations. Once an operand is computed, it must remain in storage until all sinks have consumed it. If the possibility for recomputation of a result is desired, an MA with additional state distinguishing the scheduling status (info known) of the operand from its presence in memory (info stored) must be used.
Reachability
The state ss P gwe, where all composition MA are known $ in exactly one sense (conceptually unknown in the other sense) is the system starting state or initial entry of the loop. As refined so far, a CMA still contains states and transitions that cannot be reached from this initial state and, hence, are acausal. Fig. 9 shows a three operation loop DFG and acausal CMA states and transitions. It appears as if a new iteration result (v3) is produced every time step. However, the interdependency edge from v3 to v1 requires that one iteration complete before the next iteration may begin. Hence, a minimum iteration latency schedule must require at least three time steps. If we compute the set of reachable states [10] starting from causal state ss, then acausal states and transitions, such as those shown in Fig. 9 , will never be reached and may be pruned from Á.
Symbolic image, preimage, and reachability computations are important for the reachability refinement as well as CMA exploration. Given the relation Á P gwe, all next states of some set lel may be determined by the expression Given SS, the set of reachable states RS is determined with (9) . A CMA is refined by RS with
For the example in Fig. 9 , the reachable state computation is graphically shown in Fig. 10 . This demonstrates that acausal states are not reachable from starting state set SS. Since states in SS contain no knowledge of any result operands, it is impossible for some operand farther along in the dependency chain to suddenly become known in the appropriate sense as all dependency edges must be satisfied in order during reachability.
COMPOSITE MODEL EXPLORATION
Although a CMA now encapsulates all valid executions, it is desirable to find schedules within a CMA that meet an objective function. This section describes one exploration technique for finding such schedules. (As a CMA contains all valid executions of any sequential length, other exploration techniques are appropriate for other scheduling problems and circumstances.) This exploration technique's objective function jointly minimizes iteration latency (i.e., time-steps between loop iterations) and control-steps (i.e., states in a synthesized FSM). In terms of a CMA, this objective necessitates finding minimum length repeatable paths in a CMA in which all desired operands are produced (i.e., all operations scheduled). This corresponds to finding Fig. 3 's dashed path. To do this, a starting state set and termination state set are determined through a loop cutting step. Symbolic reachability is then used to find all shortest paths from start to termination. Next, the repeatability pruning step ensures only repeatable shortest paths remain. Finally, a deterministic shortest repeatable path, called a witness schedule, is extracted at random. A witness schedule represents steps of a single execution sequence which may be directly translated into a FSM controller.
Loop Cutting
The loop-cutting step determines a starting state set. Although the dashed path in Fig. 3 represents a steadystate scheduling solution, it specifies no starting state. Ideally, it would be nice to magically pick some state along this path as an exploration starting state. Unfortunately, this would require prior knowledge of the path. Instead, a loop cut provides an overestimation of this ideal starting state. Consider a valid loop iteration. During this iteration, all operands are produced at least once in the class of problems presented here. Suppose only one arbitrary operand, lc, is picked as the loop cut. By the problem definition, this operand must be produced (i.e., its producing operation is scheduled) on all valid loop iterations. Hence, if all transitions in a CMA which produce lc in the even sense are determined, successor states to these transitions create the even loop-cut starting state, LCS. A state from every optimal steady-state loop path is contained within LCS. As an example, picking operand rv2 (operation v2's result) as Fig. 3 's loop cut results in vg IIHY III. Formally, this is
(Some scheduling problems contain noniterative operands such as precalculated coefficients. In this case, lc must be an iterative operand. Furthermore, if noniterative operands are present, a mix of cyclic and acyclic models must be used, as well as modified exploration.) The dual of vg, vg $ , is the termination set. Because of the encoding choice used so far, a dual state is found by bitwise complementation. For the working example, the dual of vg is vg $ HHIY HHH. Conventional scheduling techniques do not always guarantee optimal steady-state loop behavior. Instead, search typically starts from a natural system starting state, proceeds through the initial iteration, and then enters a steady-state behavior based on this initial iteration. Basing the steady-state behavior on this initial iteration may lead to suboptimal solutions. On the other hand, determining optimal steady-state behavior is challenging as no obvious end or beginning exists. The questions of where and how much loop iteration overlap should exist in an optimal steady-state solution usually lead to circular reasoning. In this respect, the technique presented here cleanly finds and guarantees an optimal (indeed every optimal) steady-state loop kernel since all possible executions are considered in the cut.
Initial Execution Sequence Sets
Shortest paths from states in vg to states in vg $ represent potential minimum iteration latency execution sequences. An implicit version of Dijkstra's algorithm is used to find shortest paths assuming all transitions in a CMA are weighted one time-step. The search begins with a copy of vg renamed as time-step set 0, H . Also, the initial set of reached loop cut set termination states, The execution sequence set contains path(s) requiring n time-steps from states in vg to states in vg $ vg $ . It is important to note several properties of the execution sequence set. First, paths may exist which do not lead from vg to vg $ . In the example, it is possible to reach state 101 by time-step 2, yet the dual, 010, is not present in H . Second, execution sequences for one iteration, not two iterations as contained in a CMA, are represented. As will be shown, a steady-state loop kernel may still be found from just this depth of exploration. Finally, no guarantee is made that all paths actually create all required operands at least once. Additional information, described below, must be added to guarantee this.
DFGs may consist of multiple independent graphs sharing one resource set. In this case, if a loop cut is selected from independent graph g1, vg will contain all possible state senses for graph g2 by fact that it is independent. This results in paths from vg state(s) to vg $ state(s) which schedule the first graph but stall in both senses for the other independent graph and, consequently, never schedule a single operation from g2. To circumvent this scenario, additional ROBDD variables are added to vg and vg $ which ªtagº the sense of each external output operand. These do not change during traversal and, hence, provide a record of the initial sense of an output operand. Consequently, only paths producing all output operands at least once are found. In the working example, the only external output is rv2 produced by operation v2. 
Repeatable Execution Sequence Sets
Paths exist in the execution sequence set which do not repeat or do not repeat in the required number of timesteps. To illustrate, the example's execution sequence set contains a path from 110 to 100 to 000. Unfortunately, there is no path from 000 which completes an iteration in two time-steps. (This can be determined by considering the dual. There is no path starting from 111 in the execution sequence set.) A fixed-point algorithm prunes the execution sequence set so that only repeatable shortest paths remain. Backward pruning traversals are applied to the execution sequence set until H equals as duals n . Hence, by symmetry argument, only repeatable iteration legs of at most length n remain. After repeatable pruning is applied to the working example, the repeatable execution sequence set is fI I IH H Y I I HH I Y H I HI P g.
Algorithm 2
Repeatable pruning fixed-point n n vg $ while This guarantees that all paths originating in H vg always reach state(s) in vg $ vg $ in at most n time-steps. Furthermore, for any path terminating at some state s $P vg $ , there exists a state s P H vg such that s always initiates path(s) reaching vg $ with length n. This is true by way of a CMA's symmetry and repeatable pruning. to $ and by symmetry from $ back to . This steady-state loop schedule has an average iteration latency of four but requires eight control steps. This path, where $ was first reached at Q , is preserved in vg $ . Note that if the best possible average minimum iteration latency schedule is desired, the execution sequence set must extend to a n such that all reachable states may be included in vg $ . In this way, all repeatable paths are represented and the best average combination is guaranteed.
If the repeatable pruning fixed-point fails at time-step n, it indicates that iteration leg(s) with iteration latency of n were found (i.e., all output operands were produced once), but no compatible iteration legs with iteration latency n exist to sustain loop execution. A preserved copy of the execution sequence set can be extended by one time-step with (13) and (14) and repeatable pruning attempted again. Note that no reached vg $ termination states (vg $ ) are propagated forward.
As shown, a repeatable execution sequence set of cardinality n is the set of all loop iteration legs with iteration latency n such that any of these iteration legs may be used as part of a sustainable loop schedule in which all other iteration legs (if required) also belong to this repeatable execution sequence set.
A Witness Schedule
Although a repeatable execution sequence set contains many schedules to choose from, it is often desirable to find a path from some state s directly to its dual s $ as it represents an FSM controller with the number of control steps equal to the minimum iteration latency. (This is a joint iteration latency/control depth minimization objective since schedules meeting other objectives such as average latency also exist in a CMA.) This may be done by adding enough additional information (tags) to every state s P H such that the identity of a parent state s may still be determined for children states in n . In this way, any child state s $ P n with tag encoding equaling state encoding as duals identifies a path from some state s directly to its dual s $ . Unfortunately, adding this much additional tag information at once is costly. Instead, this idea is implemented iteratively. First, a small number of state bits (5-10) for all states s P H are tagged to record their initial value. Next, a repeatable pruning fixed-point leaves only repeatable paths from parent states in H to children states in n in which parent and children states match as duals for the tagged portion of the state vector. These two steps are repeated until all state bits have been tagged. This results in a directly repeatable execution sequence set.
Given a directly repeatable execution sequence set, an arbitrary state s from its H is picked as a witness. A repeatable pruning fixed-point is applied with only s used as a new H and s $ as a new n . This produces a set containing all valid executions from s to s $ . A single execution is picked and is called a witness schedule. This single schedule is for steady-state loop behavior and may be directly translated to FSM control steps.
Loop entry and exit sequences must still be determined as this witness schedule represents only the steady-state loop kernel. A straightforward way to determine entry/exit sequences is to simply ignore operations from previous/ next iterations in the witness schedule during loop entry or exit. This produces adequate, although not necessarily minimum, length entry/exit sequences. Alternatively, minimum length entry/exit sequences may be determined by finding shortest paths from/to the natural system starting states SS and any state in the witness schedule.
EXPERIMENTS
Experimental results are reported for several traditionally referenced looping DFG benchmarks. A case study shows how a designer can use symbolic scheduling in a practical setting. Complexity issues are discussed. Results for five large synthetic benchmarks demonstrate scalability. Finally, an industrial example of meaningful scale and complexity illustrates practical application.
EWF Case Study
We use the elliptic wave filter, EWF, a common looping DFG benchmark [7] , [20] , [24] , as a case study to demonstrate how a designer might interact with symbolic scheduling. Suppose a designer needs to implement EWF using a particular standard cell and IP-block library. Given the nature of EWF, the designer decides to explore reuse of the IP block shown in Fig. 12 . Internally, this IP block contains an optimized 3-stage pipelined floating-point multiplier, a single time-step floating-point ALU, a small coefficient ROM, and one multiplexer. The timing of the multiplier's third stage and the ALU is such that they may be chained in one clock cycle. The output of the ROM is hardwired to one input of the multiplier. The multiplexer allows one external input to bypass the multiplier and directly feed the ALU. Depending on the control settings of the bypasses, this IP block may implement three functions: multiply by coefficient, multiply by coefficient and accumulate, and add.
The designer codes the EWF algorithm at an abstract level (< 100 lines) and specifies several appropriate MA (again < 100 lines). Symbolic scheduling accepts this input and determines all minimum iteration latency schedules. Table 2 summarizes results for this exploration while varying available IP blocks. At this point, the designer has the freedom to explore other IP options and configurations. Suppose he decides that a configuration with one IP block and one additional adder provides acceptable performance with a small resource contingent as the iteration latency, 18, is equivalent to using two IP blocks. Fig. 13 shows what type of local storage and interconnect the designer has in mind. A bank of registers stores intermediate results. Any of these registers connects to a function block input or output through a limited number of busses. The single IO port, which feeds bus structure 2, permits communication to and from the function blocks via the register bank.
After editing the EWF description and model files, ( $ PH edited lines), the designer now experiments with various register and bus constraints. Several fast iterations of symbolic scheduling provide the data shown in Table 3 . Given the existing 1 IP-block and 1 ALU constraints, execution of EWF is impossible with less than nine registers and no improvement occurs for more than nine. Varying available buses does vary iteration latency. The designer has a trade-off decision and opts to reduce interconnect at the expense of iteration latency by choosing the 3/2 bus solution shown in bold. (Even with three buses, both function blocks may simultaneously begin execution as a single operand may broadcast on one bus to multiple function block inputs.) Once the designer decides on a final constraint configuration, symbolic scheduling provides an optimal loop-pipelined witness schedule (control sequence) which may be directly synthesized into a FSM. Although the final selected optimal solution has an iteration latency slightly greater than what is commonly reported as optimal for EWF, it incorporates practical and important interconnect, memory, and IO-protocol constraints necessary for a realistic design. Table 4 summarizes results for several traditional benchmarks. All results were produced on an Intel-donated 866 MHz Xeon PC running Linux. As required computation resources are often a concern with symbolic and exact techniques, we report CPU seconds, the imposed ROBDD memory model, and peak ROBDD node usage. Reported time and memory use includes all symbolic scheduling steps from parsing the scheduling problem, model construction, composition, refinement, and exploration, to finally printing a single witness schedule. Peak ROBDD nodes indicates the maximum number of ROBDD nodes (one node requires 16 bytes) kept by the manager at any point during symbolic scheduling. The constraint configuration column lists resource bounds such as one two timestep pipelined multiplier (one 2-ts piped mult) imposed on the scheduling problem. Finally, when better average iteration latency solutions exist in the CMA, a lower iteration latency bound is reported in parenthesis.
Traditional Benchmarks

Elliptic Wave Filter
The first row, EWF1_IP, shows results for a highly constrained single EWF loop using an IP block from the previous case study. The next two rows, EWF1_a and EWF1_b, duplicate previously reported optimal results [20] for a single pipelined EWF. The run times are fast enough that symbolic scheduling can easily be used in an iterative design environment.
As mentioned in Section 3, allowing only one MA per vertex in SP limits solutions by permitting only one instance of any particular operand to exist at any time-step. By unrolling a DFG, this complexity bound may be directly controlled. EWF2 is one unrolling of EWF so that two MA model each vertex in SP and, hence, a maximum of two live instances of any operand are permitted. Although CPU run times increase by a factor of three to four, latency does not improve, indicating that EWF cannot benefit from more aggressive loop pipelining.
EWF contains several long interiteration data dependencies which prevent much benefit from loop pipelining. (Iteration latencies for optimal pipelined solutions are typically only one time-step improved over nonpipelined solutions.) Still, hardware resources may be underutilized. By scheduling two independent parallel copies of EWF under one set of resources, additional resource utilization may be realized. This effectively models two independent EWF streams executing on a single hardware subsystem. Two independent EWF scheduling problems are built as a single CMA with a unified set of resource bounds. The EWF1x1 row presents results for this experiment. A resource set of three single time-step adders and two two time-step pipelined multipliers is used. This resource set is fairly ideal for a single EWF loop as reducing available resources negatively impacts scheduling results, but increasing available resources does not improve scheduling results. Furthermore, an IO protocol which limits the system to one external IO transaction per time-step and orders IO transactions between EWF copies is imposed. Although the iteration latency for a single EWF loop with this resource set is 16 time-steps, iteration latency for two parallel copies improves significantly to only nine timesteps. As there is considerable added freedom (both copies are independent except for IO ordering), additional computational resources are required. We are unaware of any other scheduling results reported for parallel EWF configurations.
Fast Discrete Cosine Transform
Rows 6-10 of Table 4 present results for a fast discrete cosine transform, FDCT. The FDCT benchmark is challenging for three reasons. First, it contains two independent loops. For cyclic behavior, this independence leads to a substantial expansion of the solution space as solutions for every resource-compatible permutation of loop with over all time-steps may be represented. FDCT1_a (acyclic) and FDCT1_c (cyclic) differ only in acyclic versus cyclic modeling and highlight this solution space expansion. Second, FDCT contains no interiteration dependencies. This freedom permits considerable pipelining, but further expands the solution space. Results for FDCT1_b, which contains no resource constraints, exhibit this freedom. Iteration latency is only 2 and is constrained only by the one live operand instance modeling state bound. (Since no resource constraints or interiteration dependencies exist, two copies of a final witness schedule may be directly translated to a single FSM with iteration latency of 1.) Finally, FDCT is highly symmetric. Each path through the DFG is similar in length and operation sequence to every other path. This too enlarges the solution space and, hence, instance representation cost. Due to the high symmetry of this problem, we imposed a partial IO ordering to eliminate representation of structurally symmetric solutions.
FDCT1_b illustrates a general scheduling complexity concept. Although this result is for a challenging benchmark, very limited computational resources are required. On the other hand, the same benchmark scheduled with resource constraints, FDCT1_c, requires the most computational resources of all reported experiments. This is expected as a scheduling problem with no resource contention such as FDCT1_b reduces to topological ordering. Hence, in the absence of resource contention, a straightforward ASAP list scheduler will always find optimal solutions and require no search. Even so, symbolic scheduling of FDCT1_b still requires some computational resources as all schedules, even those observing resource constraints, are encapsulated. In general, contention for resources makes scheduling hard. For symbolic scheduling, the most challenging cases occur when resource constraints tend to balance dependency constraints. When resource constraints either dominate or do not exist, symbolic scheduling is facile.
As FDCT1_c exhibits the greatest complexity, we use it as an example for a discussion on complexity and growth. Section 3 described how a CMA is constructed. This includes all MA construction as well as a CMA's composition, dependency/capacity constraint refinements, reachability refinements, and resource constraint refinements. Surprisingly, all this accounts for an insignificant use of computational resources. For FDCT1_c, only 4.9 seconds and 1,145,662 ROBDD nodes are needed to create a CMA with 553,670 ROBDD nodes in the transition relation, Á. The largest growth occurs during resource constraint refinement when Á grows from 58,768 to 553,670 nodes. Note that at this point, Á contains all valid resource constrained executions of the scheduling problem. After 123.6 seconds of ROBDD sift reordering, the size of Á is reduced to 341,338 nodes. Most of the complexity occurs during the pruning and refinement steps described in Section 4. For FDCT1_c, finding the execution sequence set requires 200 seconds, while total ROBDD node usage peaks to 9,911,356. The largest ROBDD in the execution sequence set is 753,831 nodes. Determining a repeatable execution sequence set is not as costly in terms of node usage, but does require time. This fixed-point requires 476 seconds. The largest ROBDD in this set is only 36,109 nodes and peak node usage remains at 9,911,356 as no garbage collection is performed. Finally, finding all schedules with control steps equal to iteration latency requires 1570.5 seconds. The largest ROBDD in this set is 4,439 nodes and peak node usage does not increase. As a reward for this hard work, symbolic scheduling finds a 16 time-step solution for FDCT1_c, bettering the best previously reported result [23] of 17 time-steps for this benchmark with the same arithmetic resources. Table 5 shows a 16 time-step iteration latency solution for FDCT1_c. Each time-step lists the operations that begin during that time-step as well as the result operands in local storage and available at the beginning of that time-step. For example, operation m36 begins in time-step 4. Since all multipliers are two time-step pipelined, it produces a result at the end of time-step 5. This is latched into local storage and ready to be used at the beginning of time-step 6. Operations and result operands are shown in both the even and odd, ª $ º iteration senses. Iteration sense toggles when looping back from time-step 16 to time-step 1. For instance, the result of sII $ is in storage at time-step 16, but appears as s11 in time-step 1. Although iteration latency is 16 time-steps, delay for one complete FDCT execution is 27 time-steps. This is the delay from the first odd operations, R $ and sS $ in time-step 3, through flipping sense when looping from time-step 16 to time-step 1, to the final now even operation, a40, in time-step 13+16. Table 4 show more practical and useful configurations of FDCT than are typically reported. Two ALU resources rather than separate add and subtract resources are used. Also, one set of buses, where each bus is occupied in one way for the duration of a clock cycle, is used. FDCT1_g shows results for a minimal register contingent. FDCT1_h adds a two bidirectional IO port constraint. Reads and writes are assumed unbuffered. FDCT1_i adds the protocol constraint described in Fig. 8 . Even with forced alternating reads and writes, schedules with excellent iteration latencies are found. FDCT1_j restricts IO ports to 1. Now, IO reads must be buffered and are modeled with several additional MA. All optimal solutions are readily found, even with this tight set of constraints. FDCT1_k forces a strict ordering of input and output operands communicated though one IO port. This models what might happen in a real design where samples cannot be reordered, but must be accepted and produced in sequence. Although more registers are required, iteration latencies similar to loosely constrained configurations are still achieved. As these examples demonstrate, automata-based symbolic scheduling handles real-world design constraints, produces optimal results, yet requires acceptable computational resources. Table 6 compares Automata-Based Symbolic Scheduling, ABSS, with existing work. SST is a symbolic scheduler, while the others are heuristic. Where possible, register use is reported in parentheses. As can be seen, existing techniques produce optimal or near optimal results for these well-studied examples. For the heuristics, required computational time is typically a few seconds. For ease of comparison, only limited additional constraints are applied. In fact, only TCLP bounds register usage and only SST and Theda.Fold bound bus usage.
Other Work
Synthetic Benchmarks
Five larger synthetic benchmarks demonstrate the scalability of symbolic scheduling. A set of guidelines generated realistically shaped synthetic benchmarks. These guidelines ensure that the synthesized DFG is fairly connected, contains several interiteration dependencies, yet is reasonably random. All synthetic benchmarks contain 100 operations assigned to one of two resource classes. Resource class A consists of two single time-step units and resource class B consists of one two time-step pipelined unit. Each unit accepts two input operands and produces one output operand, as would be the case for an ALU and pipelined multiplier. A large number of synthetic benchmarks were produced and, from this pool, five finalists meeting the following two criteria were selected: First, a minimum iteration latency schedule requires loop winding, i.e., a nonpipelined schedule cannot be an optimum throughput schedule. Second, both dependencies and resource bounds must impact schedule solutions. Resource bounds are not made meaningless by tight dependency constraints and vice versa. Table 7 shows results for the five final synthetic benchmarks. A 128 MB memory model is used for all cases and computation times range from 15 to 154 seconds. These benchmarks, all academic benchmarks, as well as the ABSS source code are available on the web [11] for comparison with other scheduling techniques.
Industrial Example
Recently, symbolic scheduling was applied to a substantial industrial example. The assembly code for computing f x y on Intel's new Itanium architecture was statically scheduled. This 127 operation nonlooping controlless example is interesting because of the real-world peculiarities of the Itanium VLIW architecture. The Itanium contains six pipelined processors (m0, m1, i0, i1, f0, and f1), each of which may compute some subset of Itanium instructions. Depending on the type and complexity of the instruction, some instructions may only be assigned to one of processors m0, m1, i0, or i1, some may only be assigned to f0 or f1, some may only be assigned to i0 or i1 and so on. In the worst case, some instructions must be assigned to two processors, i0 and f0, concurrently. A hierarchical resource bound (Section 3.4) was constructed to enforce these complex constraints. Fig. 15 illustrates the six hierarchical resource bounds applied to model these constraints. They are hierarchical as an instruction which must be assigned to i0 and f0 concurrently must belong to five resource constraint groupings: ((i0), (f0), (i0, i1), (f0, f1), (m0, m1, i0, i1)). On the other hand, an instruction which may be assigned to i0, i1, m0, or m1 only need belong to one resource constraint grouping: (m0, m1, i0, i1). Through use of these six hierarchical constraints, all Itanium resource bounds were correctly modeled. For this Itanium example, it was also necessary to model a communication penalty. This penalty varied depending on which child processor consumes a result operand. For example, operation may compute a result which may be communicated to operation after a delay of > 5, but may only be communicated to operation after a delay of > 9. This type of sequential constraint was naturally modeled during specification of all MA. For instance, Fig. 16 shows an MA that requires one time-step to reach states labeled early known, yet requires two time-steps to reach states labeled late known. Depending on the required communication delay, a child's input would be enabled by either early known or late known labeled states. This approach to modeling both computation of a result and communication delays within one MA was generalized to pipelined delays of up to nine time-steps. After all is said and done, all schedules plus an optimal witness schedule with latency of 81 time-steps required 131 seconds and 9,910,334 nodes (256 MB memory model) for computation.
CONCLUSIONS
In this paper, exact looping DFG scheduling was performed without loop iteration counters or latency predefinition. Since the technique is based on a relatively general symbolic automata model, we inherit many techniques originally developed for model-checking-based verification. This provides substantial flexibility when compared with current scheduling approaches. For example, this model allows the cosimulation of an NFA to provide sequential constraints representing interface protocols of the systems as well as its subcomponents. Further, we showed how to determine all minimal latency looping schedules for the case of minimal control steps. We applied the technique on several traditional benchmarks, a number of randomly generated design examples, a realistically constrained design example, and a substantial industrial example.
Automata-based symbolic scheduling provides a rich theoretic foundation for additional scheduling work. Parallel work [9] , [11] has already addressed controldependent scheduling even for looping or pipelined behavior. As discussed in this paper, a CMA contains schedules with best average iteration latencies. New exploration techniques are needed to determine these. In a broader sense, other exact exploration and/or prunings of a CMA are possible and potentially beneficial. Additionally, heuristic exploration and/or prunings of a CMA may be helpful. Heuristics might be particularly beneficial when scheduling solution density is high or extremely symmetric. Currently, a single witness schedule is used to synthesize a FSM controller. It may be possible to synthesize superior controllers with dynamic behavior directly from a refined CMA or repeatable execution sequence set. Scheduling solutions are limited by the freedom specified in the original scheduling problem, SP. Implicit restructuring techniques (i.e., algebraic associativity, distribution, etc.) which explore additional freedom could be developed using nondeterministic alternatives both in an MA and in the construction of a CMA. Finally, abstraction, hierarchy, and partitioning potentials should be explored. By design, an MA and a CMA are fundamentally equivalent structures. This enables a natural yet systematic route to scheduling extremely large problems via a hierarchy of refinement.
