This paper describes a new algorithm for generation of scheduling constraints in networks of communicating processes. Our model of communication intertwines the schedules of the machines in the network: timing constraints of a machine may a ect the schedules of machines communicating with it. This model of communication facilitates the modular speci cation of timing constraints 1]. A feasible solution to the set of constraints generated gives a schedule for each machine in the network such that all internal constraints of each machine are satis ed and communication between machines is statically coordinated whenever possible. Static scheduling of communication saves on the cost of handshake associated with dynamic synchronization. Our algorithm can handle complex, state-dependent and cyclic timing constraints. Experimental results show that our algorithm is both e ective and e cient.
Introduction
This paper describes how to generate scheduling constraints for communicating processes. The algorithm proposed in this paper generates a system of scheduling constraints that include constraints that are internal to each process as well as constraints that are necessary for coordinating communication between processes. A solution to the set of generated constraints provides a feasible schedule for each process such that communication between processes is statically coordinated. The advantages are twofold: the designer does not have to manually embed timing constraints in a process to account for timing properties of other processes, and the costs of full handshake are avoided whenever possible.
Our algorithm allows the design to include both dynamic and statically scheduled communication, with the algorithm automatically scheduling the static communication. Our algorithm can handle both arbitrary control, including both loops and gotos, as well as arbitrary path-activated constraints which impose arbitrary control dependencies on the applicability of a timing constraint. Previous methods have been limited in the types of control they can handle (particularly cyclic control) and/or the types of scheduling constraints which they can handle. Scheduling algorithms which cannot handle arbitrary control must make approximations, such as breaking cycles in the state transition graph and in the related constraints; such approximations can negate the advantages of statically scheduled communication.
Our approach schedules synchronous systems which will be implemented as traditional registertransfer automata 2]. Some previous work, described in more detail in Section 3, has studied the veri cation and synthesis of timing constraints in asynchronous systems. Scheduling of synchronous and asynchronous systems is very di erent because implementation delays are known in synchronous systems. While the arrival times of transition times in asynchronous machines is not always known, the times of events in synchronous systems are tied to the clock. The scheduling problem for synchronous systems is to choose a particular implementation by choosing a schedule; the related problem for asynchronous systems is to ensure that the implementation will work with any legal combination of delays. Static scheduling of synchronous systems allows the communication between The next section describes our problem|scheduling of I/O in communicating machines|in more detail. Section 3 reviews related work. Section 4 reviews the behavior nite state machine model. Section 5 describes the main ideas and goals for scheduling of communication. Section 6 describes the di erent types of component constraints and their generation. Section 7 gives the algorithm for generating all the constraints. Section 8 discusses the solution of the system of constraints generated using linear programming. Section 9 describes how some allocation constraints can be included in our scheduling approach. Finally, Section 10 describes the results of experiments with our algorithm.
Problem formulation
Communication may be scheduled either statically or dynamically: communication times are static when the times of the communication events can be determined at compile time; communication scheduling is dynamic when the communication time is data-dependent and must be coordinated by handshaking during chip execution.
Consider the example of Figure 1 : machine M 1 wants to send a value to M 2 . Static scheduling chooses a time for each communication event, such that each component machine is in the right state to send or receive. Since both machines start from known states, each can predict the time to send or receive a value. However, the exact time of each event depends on the times of other input/output... 
endif;
y <= 1; we may want to specify di erent timing constraints from x <= 0 to y <= 1 depending on whether the then or else branch is taken. Such timing constraints will be referred to as path-activated. In addition to generating constraints to coordinate communication we also need to incorporate those path-activated constraints whose activation condition can be exercised during network execution.
Static scheduling of communication is a key tool for high-level synthesis. Most systems of communicating machines include statically-scheduled communication. Requiring the designer to manually schedule that communication by adding and deleting states is both tedious and error-prone. Even if some communication in the system is dynamically scheduled, a complete synthesis system must provide e cient scheduling of static communication.
A feasible schedule for each process is found by formulating the system of generated constraints as an integer linear program (ILP). (A separate line of research 4] has developed a graph-based algorithm which can solve these types of ILP problems much more quickly than can standard ILP packages.) To generate the constraints in the ILP, we must search the execution trace of the communicating machines to identify communication events.
We start out only with constraints within each component machine|there is no a priori information on the communication between machines. Because the constraint generation algorithm analyzes the network to identify communication, the designer's task becomes much simpler: he or she can design each component machine relatively independently, relying on the scheduling algorithm to properly implement communication. The constraints which ensure that the input and output which form a communication are similar in spirit to structural dont-cares in Boolean networks|the communication constraints are induced by the mutual behavior of the networks components and cannot be deduced by examining a component in isolation.
Static scheduling of communication is an important and useful tool precisely because the constraints required to implement the condition are properties of the system. When any one of the components is modi ed, the communication constraints with all the other component machines may change; nding those constraints manually each time the design is changed is tedious and errorprone. Static scheduling is useful when one component machine is fully scheduled and the other is not|in this case, scheduling constraints will be found which ensure that the unscheduled machines I/O behavior matches that of the fully scheduled one. The same static scheduling algorithm can also be used to jointly schedule two machines which both have some scheduling freedom. In this case, rather than over-constrain the scheduling problem, our algorithm nds the minimum set of constraints required to implement their full communication, leaving freedom to schedule the machines to optimize their associated datapaths or other aspects of the design.
The constraint generation algorithm used to schedule the system must identify matching input and output events, identify events which cannot be scheduled statically, identify which constraints can be exercised along each execution path, and generate all appropriate constraints. Figure 3 shows the ow of our scheduling algorithm. The design is speci ed as a network of behavioral FSMs (BFSMs) 5]. The network includes the speci cation for each component and the connection pattern between components. Each BFSM includes internal timing constraints that are given directly by the designer or derived automatically by the compiler that produced the BFSM speci cation (such as PUBSS 6]). We must generate communication constraints which implement statically scheduled communication. Once all the constraints have been generated, we can solve the system to produce a feasible schedule for the network; the register-transfer implementation can be generated directly from the schedule and the BFSMs. This paper concentrates in the generation of the constraints for a network of BFSM components.
A major advantage of our technique is that it preserves the design's process partitioning. We want to e ciently implement the scheduled, register-transfer system; the designer's functional decomposition often provides a good starting point for e cient sequential, logic, and layout implementation. We do not want to lose the partitioning information during high-level synthesis|if we are given a partitioned design, we want to generate a partitioned design for sequential synthesis.
Previous work
Relatively little work discusses scheduling of multiple communicating processes, though several groups 7, 8] have identi ed speci cation and scheduling of distributed control as a critical research problem. Both Nestor 9] and Hayati 10] studied interface synthesis. In both cases, the designer was required to explicitly describe the scheduling constraints on each component: if A and B were connected, A's constraints on B must be placed in B, and B's constraints on A must be placed in A. Their methods gave techniques for describing and solving scheduling constraints. Borriello 11] synthesizes the interface transducer between two given interface speci cations that include timing constraints such as set up and hold times. Scheduling of the interface speci cations is not considered since both interface speci cations correspond to already fully scheduled hardware components.
The Princeton University Behavioral Synthesis System (PUBSS) 6] used product machines to support full scheduling of arbitrary communicating machines. Unlike Nestor's and Hayati's work, scheduling constraints described in one process were automatically imposed on the communicating processes during generation of the product. Collapsing could also be used to impose external scheduling constraints described as independent components 1].
Previous work by Camposano 12] and Wakabayashi and Yoshimura 13] developed methods for scheduling operations with branches. Both those e orts developed algorithms to maximize datapath resource sharing across mutually exclusive branches. They restrict the scheduling freedom in the problem by xing the times of branch operations|no operations are allowed to move from one side of the branch to the other. Since cycles in the behavior are broken before scheduling, constraints that would cross those cutting points { inter loop iteration constraints { cannot be taken into account. Finally, they consider only the order of operations, not constraints on the minimum or maximum times between operations.
Traditionally, behavioral representations primarily addressed timing constraints between operations that occur in the same basic block. An exception is the OEgraph representation of Amon et al. 14, 15] which can express the context under which timing constraints should hold through the use of a restricted set of rst-order predicate calculus. OEgraphs can describe behaviors and constraints similar to those that can be described using BFSMs. The two most important similarities in the descriptive power of these two representations are the distinction between events and event instances and path-activated constraints. However, the OEgraph does not have a state-transition representation. Amon also does not de ne a product of two OEgraphs, which is important for communication scheduling. OEgraphs have not been used to solve the static communication scheduling problem described in this paper.
Both cyclic behavior and numeric bounds on times of operations are very important to controldominated systems. For example, a controller's requirements often constrain the maximumallowable time interval between successive outputs in a loop|in this case, both a numeric bound on event times and a cyclic constraint are required. Our work directly schedules branches and handles cyclic constraints; it also handles constraints which give lower or upper bounds on the relative times of operations (precedence constraints are a special case of timing constraints.) Antoniazzi et al. 16 ] co-design hardware-software systems using a product-machine scheme. They specify processes as partially scheduled state machines and form the product to nd the system schedule; this scheme was the rst communication scheduling method used by PUBSS 6] .
Recent work by Filo et al. 17 ] addresses a somewhat di erent problem in the synthesis of communicating machines. Their algorithm uses an interface matching algorithm to classify communication as either dynamically or statically scheduled; they use a greedy algorithm to choose communication events which can be statically scheduled; they then generate a schedule which ensures correct behavior for both the dynamically and statically scheduled communication. Their algorithm works on a hierarchical control data ow graph and cannot handle arbitrary control structures.
Yang et al. 18 ] represent constraints in control by building a non-deterministic nite automaton (NFA). They can schedule systems of machines by forming the product machine, which they represent as a BDD. However, some constraints can cause the non-deterministic machine to become very large; the scheduling constraints are represented only implicitly as paths through the NFA, not explicitly as in BFSMs, requiring algorithms to traverse long paths through the NFA to identify some constraints. fully orders the inputs and outputs in time is equivalent to a register-transfer FSM (RTFSM) implementation|an RTFSM's inputs and outputs are fully ordered in time.
A BFSM's external behavior is de ned in terms of input and output events (I/O events). An event is a value of a pin for one clock cycle. Figure 4 illustrates a trace of a BFSM's execution. The trace shows a set of input events fed to the machine and the output events generated as a result; both input and output events are shown as ovals with the pin names and values of the event. The input and output events are partially ordered|the trace also shows constraints on the relative times of input and output events, as shown by the edges between events. A constraint is a linear inequality on the relative times of the events of the form t(e sink ) t(e source )+c, where the function t(e) returns the time of the event e in the trace, relative to the start of execution. The weight on the constraint arrow gives the value of the constant c. The constraint between the last two input events speci es that the third input i = 1 must occur at least two clock cycles after the second input i = 0. The two constraints on the rst input (i = 0) and rst output (o1 = 1) specify that both events should occur at the same clock cycle. A set of I/O events and constraints is called an I/O event constraint graph. The BFSMs execution model is similar to that of a traditional nite-state machine, except that the inputs and outputs are partially ordered.
While a trace illustrates the computational model of the BFSM, a more convenient representation of a BFSM is its behavior state transition graph (BSTG). An example BSTG is shown in Figure 5 . Timing constraints have been intentionally omitted to emphasize the state-machine-like qualities of the BFSM. An event includes the name of the pin on which the event occurs. An input event is marked with a`?'; an output event includes the value placed on the pin. Each event is labeled with a unique identi er, e.g., e1, e2 etc. The current state accepts a set of input events and produces a set of output events (either set may be empty). All the transitions out of a state test the input pins at the same times, but expect di erent values to appear at those times. Therefore, Figure 5 uses a slightly di erent notation than did Figure 4 . The BSTG gives '?' values for the values of input events; the values which trigger a transition are given as the input condition notation for that transition. Transitions out of each state are marked by a set of values for the input events |the values applied for the input events during execution determine which transition is taken out of the current state. One of the states is denoted as the reset state.
We must di erentiate between events that are part of the BFSM speci cation and the instances of events that are produced during its execution. If we traverse a state transition graph repeatedly to generate a trace of BFSM behavior, we may visit the same I/O event several times in the BSTG; each separate visit creates a distinct I/O event instance in the trace. Figure 6 shows an expanded version of the execution trace given in Figure 4 . Instances of events are denoted as the speci cation event annotated with a unique superscript index to di erentiate them, e.g. e1 Before discussing how timing constraints are speci ed, it is instructive to look at how a BFSM speci cation is implemented once it has been scheduled. Figure 7 shows a register-transfer nitestate machine (RTFSM) implementation for a feasible schedule of the BFSM shown in Figure 8 (same as Figure 5 with timing constraints included):
BFSM states S0, S1 and S2 map to RTFSM states s0, s1 and s2 respectively. Each BFSM transition T takes t T clock cycles to complete where t T is a non-negative integer value determined during scheduling. The implementation shown in Figure 7 is based on the following schedule: t T4 = 2, t T5 = 2, t T1 = 1, t T2 = 1 and t T3 = 1. Each BFSM event e occurs t e clock cycles relative to its associated BFSM state where t e is a non-negative integer value determined during scheduling. The implementation shown in Figure 7 is based on the following schedule: t e1 = 0, t e2 = 0, t e3 = 0, t e4 = 0, t e5 = 0. A single BFSM may have several di erent RTFSM implementations, corresponding to di erent schedules for the BFSM's events. These register-transfer implementations are not sequentially equivalent. They all, however, satisfy the behavior speci ed by the BFSM.
More insights about the relation of the schedule and the implementation are provided in Section 6.2. The general procedure for the translation of a scheduled BFSM into an RTFSM is given elsewhere 5, 19] . Note that there is a close relation between the BFSM speci cation and its RTFSM implementation. De ning the schedule in terms of the t T 's and t e 's in e ect restricts the set of possible implementations to a subset of all possible valid implementations. For instance we could have considered an implementation where the loop given by T4 in Figure 5a is unrolled several times and scheduled di erently for each iteration of the unrolled loop. In our approach we can obtain such an implementation by explicitly performing loop unrolling on the unscheduled BFSM.
Timing constraints express timing relationships between events that an implementation of the BFSM should obey during its execution. Timing constraints between instances of events are given as linear inequalities on the times of the events instances. For example, the inequality:
t(e i 1 ) + W constrains event instance e j 2 to occur at least W cycles after event instance e i 1 takes place. Note that the above constraint relates the time of speci c event instances and it is not meant to imply that the constraint should hold for all pairs of instances of events e 1 and e 2 . In order to avoid any ambiguity about which instances of events are meant to be related by the constraint, constraints need to contain some information on the context that uniquely identify the relationship between the event instances. Such a context is the execution path (sequence of transitions) that leads from e i 1 to e j 2 , i.e., the constraint only takes e ect between instances of e 1 and e 2 that obey the given path condition. Figure 8 shows the BFSM of Figure 5 , but with timing constraints speci ed. Constraints are classi ed in two categories depending on whether or not they have a condition on their activation:
1. data ow: timing constraints that are speci ed between events that don't cross conditional behavior. Figure 8a shows data ow constraints between events e1 and e2 and between events e4 and e5. Data ow constraints are unconditionally activated.
2. path-activated: constraints which cross a BFSM state and are activated by control conditions. Path-activated constraints must be annotated with a path activation condition as well as the integral constraint weight. For instance, the weighted edge e5 ?3;(e3;T5;T1) ?! e3 translates into t(e5 i ) t(e3 j )+ 3 where e5 i and e3 j are event instances in the execution such that e3 j occurs rst and e5 i is encountered after following transitions T5 and T1.
The path-activated constraints of the BSTG of Figure 8 are easier to visualize once the execution is unrolled as shown in Figure 9 (data ow constraints are not shown). Superscripts are used to di erentiate instances of each event produced during the unrolling process. Note that events over the same I/O port are implicitly constrained to occur in distinct clock cycles and thus such constraints are not explicitly speci ed.
Path-activated constraints allow us to be precise about cyclic constraints, e.g. the constraint e5 2;(e5; T 4) ?! e5 is translated to t(e5 j ) t(e5 i ) + 2 for event instances e5 i and e5 j that occur in contiguous iterations of the loop (transition T4).
The constraints in Figure 8 can be expressed as: Figure 10 : A network of BFSMs.
A BFSM network is an interconnection of BFSMs as shown in Figure 10 . Communicationbetween BFSMs occurs over the I/O ports that interconnect them. The execution of the BFSM network relies on the matching of output events with input events across machines. The process of matching is analogous to the way a send(x,1) and receive(x) communication occurs (as was shown in Figure 2 ), e.g., an output event \e1:x=1" matches an input event \e2:x?", e1 2 B 1 and e2 2 B 2 provided a point in the network execution can be reached where BFSM B 1 produces an instance of e1 and B 2 produces an instance of e2. The matching event pair e1 and e2 will be shown as he1 $ e2i.
As indicated above, BFSM I/O events are not bu ered|an output value is held through exactly one clock cycle as exempli ed in the implementation of Figure 7 . The values for the outputs o1, o2 and o3 are either don't cares (e.g., if they are connected to BFSM inputs) or can be set to a default value (e.g., if they drive register load inputs). I/O is bu ered explicitly using BFSM registers. BFSM registers are mapped to registers in the RTFSM implementation. A BFSM register during execution matches output events that sets the register's value and maintains that value until its input matches a new value. A BFSM register matches any number of input events that request its value. Communication that occurs through a register is not implicitly synchronized, e.g., if BFSM A writes to register R which is read by BFSM B, there is no guarantee about the timing between writes by A and reads by B unless provided by synchronization through other communication transactions.
Scheduling communication in BFSM networks
Scheduling a network of communicating machines requires generating constraints to coordinate communication between components and constraints to satisfy the internal timing requirements for each component. Solving the system of inequalities gives times for all the events in each component machine, which is equivalent to a register-transfer implementation of each machine. Since the constraints are formulated in terms of the components' I/O events, the solution preserves the initial design partitioning.
To understand how communication can be identi ed and formulated as a scheduling problem, consider the network shown in Figure 10 . Some of the components in the network represent behavior that we would like to synthesize (M1, M2 and M3) and some of the components correspond to timing behavior models that capture the interface behavior of already existing hardware (IF1 and IF2).
Unless, an I/O signal is explicitly bu ered, an I/O event lasts for exactly one clock cycle; an output event and its matching input event must be scheduled to occur in the same clock cycle. Communication constraints are added to the system of constraints so that any feasible solution to that system statically synchronizes communication whenever possible. Our approach eliminates the need for manual synchronization and/or extensive use of dynamic synchronization (handshake) to resynchronize at every point of communication. Figures 11, 12 and 13 show the three distinct cases that must be considered. We have two Given that events e1 and e6 are synchronized, the constraint needed to assure that the communication he3 $ e8i is synchronized relative to he1 $ e6i is: (t T1 ? t e1 ) + t T2 + t e3 = (t T6 ? t e6 ) + t T7 + t e8 :
The three additional equations:
(t T1 ? t e1 ) + t T3 + t e4 = (t T6 ? t e6 ) + t T7 + t e8 ; (t T4 ? t e3 ) + t e5 = (t T8 ? t e8 ) + t e9 ; (t T5 ? t e4 ) + t e5 = (t T8 ? t e8 ) + t e9 ; guarantee that the communication he4 $ e8i is synchronized relative to he1 $ e6i, and he5 $ e9i is synchronized relative to both he3 $ e8i and he4 $ e8i. By transitivity, the constraints above guarantee that the communication he5 $ e9i is also synchronized relative to he1 $ e6i. Figure 12 corresponds to the case where the path between e1 and e2 in B1 contains no unbounded delays but the path between e3 and e4 may have a path with an unbounded delay due to loops that are dependent on primary inputs. We assume that the speci cation for B1 has a busy wait (T3) to resynchronize with B2. We need to enforce that the busy wait start no later than the earliest time that e4 provides the \1" for which we are waiting. The two constraints t T3 = 1; t e2 = 0; guarantee that we check x every clock cycle to make sure the event x = 1 is not missed. In addition, we need to ensure that the busy wait is initiated no later than the earliest time that e4 could occur, i.e., for every acyclic path (no state is visited more than once) between the two states shown for B2 we need a constraint:
(t T1 ? t e1 ) t(P) ? t e3 + t e4 where P is the acyclic path (set of transitions) and t(P) = P T2P t T . In all three cases we go from any synchronization point between B1 and B2 to the next synchronization point along any possible path of the execution that B1 and B2 follow.
We assume that the designer has placed busy waits where they are required to perform dynamic synchronization. Our algorithm identi es the busy waits and determines which of the three cases described applies. Static synchronization saves on the cost of handshaking signals but may introduce an excess of states. The designer may choose to use dynamic synchronization instead of static synchronization when deemed more e cient.
The algorithm for collecting communication constraints can be summarized as: for every pair of component BFSMs (B i ; B j ) in the network, collect all constraints that relate any two consecutive communications along every possible combination of paths for B i and B j that could occur during network execution. We will describe this process in more detail in Section 7.
It is important to note that \executing" the network to expose all possible interaction between the component BFSMs is a task that is also required when performing BFSM network collapsing. A network could be scheduled by rst collapsing the components, then scheduling the resulting single machine. However, collapsing destroys the designer's partitioning information. Furthermore, full network collapsing requires the additional step of nding a BFSM which captures the network I/O behavior. Finding the equivalent BFSM description for components that are loosely coupled and have loosely speci ed timing is in general hard and yields a complex and large BFSM description. When there is no synchronization between two BFSMs the joint behavior may very well be impossible to represent as a single BFSM, e.g., consider two concurrent, unsynchronized loops represented by transitions T1 and T2 and timing constraints requiring t T1 2, t T2 5. The technique described here nds a feasible network schedule while maintaining the original partition intact. Figure 14 shows how internal constraints are generated. Data ow constraints are added with no changes to the set of generated constraints. Path-activated constraints are tested for their activation during network execution and those which are activated are added to the set of generated constraints. Structural constraints enforce certain properties of the schedule that ensure that the RTFSM implementation has a close structure to the BFSM. Structural constraints are automatically generated to relieve the designer from the task of specifying them explicitly.
Path-activated constraints
As explained in Section 4 path-activated constraints are easily translated into inequalities in terms of relative schedule times of events (t e 's) and of transitions (t T 's). However, some behavior in a component machine may never be exercised when operating with the remaining machines. Constraints that are conditioned on behavior that cannot be exercised are ignored since they would unnecessarily restrict the solution space of feasible schedules.
Structural Constraints
Transitions in the BFSM model give a ow of execution and do not necessarily require time to complete in an implementation of the behavior. A sequence of n transitions with no events should not require n clock cycles. For instance, in Figure 15 even though inputs events e1 and e2 are separated by a behavioral state, they could still occur at the same time. Such a BFSM may be generated by syntax-directed translation of nested if statements. The RTFSM realization is shown in the same gure. Note that behavioral states S0 and S1 are coalesced into state s0:s1 in the RTFSM implementation since t T1 = 0.
In this case event e1 spilled to next state S1 because t e1 t T1 . Whenever such a condition is true, overlapping between transitions occurs. A more complicated situation arises if events are scheduled such that they spill over into a join state. Such a situation is depicted in Figure 16 where t e1 = t T1 and states S0 and S1 coalesce into state s0:s1 To preserve the intended behavior, state S1 is split into state s1a and the one that is part of s0:s1.
Consider the example in Figure 17 where we have a BSTG for a loop that has events a, b and c. If we relax the requirement that all the events should nish before we go to the next state, we can schedule this BFSM so that a new iteration of the loop is initiated before the previous iteration has completed. Figure 17 shows the unrolled implementation for the indicated schedule. The RTFSM implementation is structurally di erent than the BFSM. The complexity of the RTFSM implementation increases as the number of loop iterations that are overlapped increases.
The last two examples are special cases that involved some sort of transformation such as state splitting or loop unrolling. In this paper, we will assume that the designer would like to preserve the structure of the BFSM and we add structural constraints that prevent events from moving across join states. The added constraints are: where T Join is the set of transitions whose sinks are join states (states with 2 or more incoming transitions). These structural constraints are implicitly inferred from the original BFSM description and are added during the scheduling phase.
Constraint generation
As mentioned in Section 5, communication constraints are generated by walking through the combined execution paths of the communicating components. In order to generate communication constraints we need to nd the following information about how each pair of BFSMs (B i ; B j ) interacts while executing in the network:
Find the set of communication (matching) pairs M(B i ; B j ) = fhe n $ e k ije n 2 B i ; e k 2 B j g. For each communication pair he n $ e k i 2 M(B i ; B j ) nd the set N(he n $ e k i) of communication pairs that immediately follow during network execution and are not scheduled dynamically. For example, in Figure 11 , N(he1 $ e6i) = fhe3 $ e8i; he4 $ e8ig, in Figure 12 , N(he1 $ e3i) = fhe2 $ e4ig, and in Figure 13 , N(he1 $ e4i) = ;. For all he n $ e k i; he m $ e l i 2 M(B i ; B j ) such that he m $ e l i 2 N(he n $ e k i), nd the set P(he n $ e k i; he m $ e l i), of all possible path pairs (p i ; p j ), where p i and p j are execution acyclic paths (no state is visited more than once), for B i and B j respectively that lead from the input-output pair he n $ e k i to he m $ e l i. T4   T0  T1   T2  T3   T0  T0   T0  T0  T0  T0  T1  T1  T1  T1   T1  T1   T3 T2 T3 T2 Figure 20 : Jointly executing the two machines to collect constraints. Figure 18 shows the paths that lead from the input-output pair ha $ xi to its consecutive communication pair hb $ yi. Transitions that are shown in dashed (solid) lines in B1 are consistent in the network execution with transitions shown in dashed (solid) lines for B2. Not all combinations of paths need be consistent since machines may in uence each other or may be in uenced by common inputs. For this example P(ha $ xi; hb $ yi) = f(T1 T2 T3; T9 T10 T11); (T4 T5; T6 T7 T11); (T4 T5; T8 T11)g:
Finding the sets described above requires an exhaustive execution of the network. While generating the information above, the path-activation condition of each constraint is also tested as described in Section 6.1. Such an exhaustive execution would also be used for validating the correctness of the design, e.g., checking for deadlocks. We rst present an example to illustrate the constraint generation algorithm and then present a more general description of the algorithm.
An example
Assume we have two communicating BFSMs as shown in Figure 19 . Signal a connects both machines. Events are annotated with unique names e0; e1; e2; e3; e4 to refer to them in a concise and unambiguous manner. Machine B1 has conditional behavior that depends on the primary input i.
Given that both machines are in their reset state, we start to jointly execute both machines as shown in Figure 20 . Superscripts are used to di erentiate instances of events. Initially, there are Machine Scheduled Time B1 t e0 = t e1 = t e2 = 0 t T0 = t T1 = 0 t T2 = t T3 = 3 B2 t e3 = t e4 = 0 t T4 = 6 a feasible schedule Since in each path of execution we have encountered a matching that was already seen before (matching between e1 or e2 with e3) we terminate the search for communicating constraints. When we encounter an instance of e2, its path-activated constraint is passed down the execution tree for testing. The path-activated constraint that originates from e2 0 is carried down the path T2; T1; T2; T1 until it reaches e2 6 . It is then activated and its constraint 2 t T1 + 2 t T2 6 is added to the system of constraints. Since all the path-activated constraints are activated the testing for them is terminated along all paths. Figure 21 shows a feasible solution for the set of constraints generated. Figure 22 shows the general ow of the algorithm for generating constraints. Each component is rst traversed to collect its data-ow constraints and generate its structural constraints as described in Section 6.2. The network is then executed to test which path-activated constraints are to be included in the system of constraints and all communication constraints are also generated. Execution of the networks starts at the reset state of the network and proceeds with the computation of all the possible next states of the network: match events, feed primary inputs, re machines that can re (I/O events consumed, transition condition satis ed). Each computed next state is placed in states to traverse as long as the function terminate path() returns FALSE indicating that we still should search along the current path for generating communication constraints or testing of path-activated constraints. Termination conditions are described below. In the current implementation of our algorithm states to traverse is a queue and the function next(state to traverse) returns the network state at the head of the queue. In this case we are doing a breadth rst execution of the network. Depth-rst execution could have been chosen instead. The user may choose to write a BFSM model that feeds the original network and exercises the network behavior known to be of interest.
The next \state" of the network S + is a concatenation of the information of the \state" of each machine in the network. The state of each machine consists of the behavioral state, the status of each event (matched or not matched), and the value for each matched input. The next state of the network is captured only each time one or more machines change behavioral states. Note that we don't model datapath components such as registers using BFSMs (except to model their timing). Datapath components can be interconnected with a network of BFSMs, but during constraint generation we treat inputs that come from datapath components (such as an input that comes from a functional unit that compares the values in two registers) as primary inputs. A subset of the network states are remembered (recorded) for the purpose of identifying whether the network has already been in a given state. Network states only need to be recorded when a new matching occurs, e.g., in Figure 18 we are only concerned about the network states corresponding to ha $ xi and hb $ yi and to capture the paths traversed to construct the set P(ha $ xi; hb $ yi).
Network execution is somewhat similar to simulation of VHDL communicating processes in that each process take turns to be simulated. In VHDL a process is executed until a wait statement is encountered at which time the process is suspended and another process is reactivated. In our case, component BFSMs also take turns to be executed until a communication I/O event instance that does not have a corresponding matching I/O event is encountered. At that point, execution is transferred to the BFSM that can supply said matching event instance.
Consecutive communications pairs (consecutive in some execution path) are related as described in Section 5. A necessary and su cient condition for nding all the pairs in M(B i ; B j ) is that all the behavior of the network be explored (executed) at least once. For each communication pair he n $ e k i, we need to explore all joint execution paths (P i ; P j ) that lead to the next communication pairs. Some execution paths may not have a next communication pair, i.e., a cycle in the BFSM that does not have communication events. Such cycles are easily detected by saving the execution path for each BFSM since the last communication event. Whenever such a cycle is encountered, the number of iterations that the loop will execute is data dependent and the total delay is potentially unbounded. For any pair of communicating machines (B i ; B j ), the collection of communicating constraints can be terminated along the current path provided the following conditions are satis ed:
The current network state has been visited before.
A matching pair for (B i ; B j ) is encountered, or the current path completes a cycle without communication between B i and B j . Testing of the activation condition for each path-activated constraint entails checking whether its condition path is ever exercised during network execution. Testing is initiated whenever the event that originates the constraint is encountered during \yet unexplored" network execution. The process of testing for path-activation consists of comparing the condition path (list of transitions) to the current transition being executed. If the comparison fails then the test fails along the current path. Otherwise, the constraint is passed down the execution for further testing until the condition path has been exhausted in which case we have veri ed that the constraint is activated. Once the activation condition of a constraint has been con rmed, further checking for that constraint is not necessary and the constraint is added to the system of generated constraints. If all the initiated checks for a path condition of a constraint fail, the constraint is not included in the system of generated constraints.
In summary, the current execution path can be terminated provided the termination condition for collecting communication constraints is satis ed for every pair of communicating machines in the network and there are no path-activated constraints in the process of being tested in any of the machines. If the network is ill-speci ed (e.g., the component machines deadlock in some execution trace), we can use heuristics such as iteration counts to terminate the search.
Solving the generated constraints
The set of inequalities and equalities generated in previous sections can be expressed as an integer linear programming problem. We can express all the generated constraints in the form A 1 x b 1 A 2 x = b 2 x 0 where x = x 0 x 1 : : :x n?1 ] T ; each x i is a variable denoting a time for an event or transition as described in Section 5. A 1 and A 2 are matrices of integer coe cients and b 1 and b 2 are integer vectors. Usually matrices A 1 and A 2 will consist of mostly 0s and 1s. We may just want to nd a feasible solution or we may want to nd an optimal solution based on a linear cost function: minhc; xi where c is a vector of weights. We could choose to give a higher priority to minimizing the duration of some transitions that are in some critical path. In many cases the variables turn out to be integer when using a regular LP solver such as the simplex method 20]. However, certain types of constraints may produce non-integer solutions from a regular LP solver. For instance the pathactivated constraint e 3;(e;T ;T)
?! e translates into the constraint 2 t T 3 and produces the noninteger value t T = 1:5 when the objective function is set to minimize t T . In all the examples that we ran the regular LP solver produced integer results.
The parameters that can be used to characterize the complexity of solving the system of constraints are the number of variables, the number of constraints and average number of variables that appear on each constraint.
The total number of variables in the system of constraints for statically scheduling a network N is: The structure of the system of generated constraints could be exploited to come up with faster and more e cient algorithms to solve it. A possible line of attack is to rst nd a feasible solution for each component (e.g., using an ILP solver) and then use iterative techniques to nd a solution which also satis es the communication constraints. 9 Adding resource constraints So far we have not dealt with the allocation of resources { all the resources are implied by the events so if we have three events that each require an adder unit and they are all scheduled at the same clock cycle then we need at least three adders. If we allow 0-1 integer linear programming we can add resource constraints in a similar fashion as done in 21]. We will brie y describe how constraints can be added to re ect the cost of functional units or limitation on the number of functional units. The rst inequality says that the number of events that require FU k scheduled at any given cycle cannot exceed M FUk . The second equation guarantees that event e i will only be scheduled at one step as described by the variables x ei;j . The third equation relates the time of the event ei with the corresponding variable x ei; e i .
It is important to note that if we want to share functional units across machines then we need to consider every trace of network execution where events e i 2 E FUk may overlap when scheduled.
Each of the traces then generates constraints as described above.
In general, the large number of 0-1 variables required may limit the use of the proposed algorithm to problems where the number of functional units of any kind that could overlap at any time is small or where these regions of overlap are small (restricted movement of operations). We expect controldominated designs to have these characteristics. 10 Results
Scheduling of communicating FSMs is useful whenever tightly-coupled machines are used to implement control. In some cases, such as an Ethernet controller, multiple controllers are bu ered by queues, leaving few opportunities for static scheduling. However, many machines use tightlycoupled controllers. For example, pipelined datapaths, such as MIPS-X 22], often use distributed control which must be statically scheduled. Processing elements in neural network array processors are built from multipliers, registers, and control. The controller, since it must implement the non-linear neuron function, must be tightly coupled to the multiplier. When multipliers such as the AMD Am29C327 are used, which allow both combinational and pipelined operation, the control scheduling changes with the multiplier con guration. The tightly-coupled controllers need not be on the same chip to be scheduled by our algorithm. For example, a queue processor 3] includes two controllers, one for enqueueing and another for dequeueing, which are only loosely coupled to each other, but each of them requires tight coupling to the external environment. Constraints on the timing of data and control signals can be speci ed as separate BFSMs with equality constraints; when scheduled simultaneously with the queue processor controllers, those external constraints are imposed on the chip design. We have implemented our constraint generation methods as a part of the Princeton University Behavioral Synthesis System (PUBSS). The example ex2 ( Figure 17 ) is a small synthetic example. We also ran 3 additional complete examples: one for a queue machine based on a commercial ASIC 3], one for a packet receiver, and one for part of the i8251 USART. We also created three large synthetic examples to test the algorithm: ex8 has a relatively large number of states with relatively simple control ow between them; both ex9 and ex10 have more complicated control ow. We show the total number of states, events and constraints per component machine. We also show the total number of constraints (data ow, path-activated, communication ) that were collected for the ILP problem, and the CPU times for an SGI Indigo workstation for the constraint collection and solution phases of scheduling.
We used a very unsophisticated simplex LP program to solve the systems of constraints. The cost function were set to schedule events as early as possible. Separately, Yen and Wolf 4] have developed a graph-based algorithm for the solution of constraints in BFSMs. Because this algorithm takes advantage of the sparsity of typical examples, it can solve large systems of constraints much more quickly than can standard ILP packages. Figure 23 shows the relationship between the number of constraints generated in the ILP and the CPU time required to collect those constraints. The plot shows that while the number of constraints has some relationship to the CPU time, it is not a strict predictor. When the product state space of a network of FSMs is traversed, the size of the space depends on the structure of each machine and their interactions, so the complexity of the space is hard to predict. A quick perusal of Table 1 shows that number of states in the BFSMs is not a reliable predictor of the constraint generation time, since ex8 has many states but could be quickly traversed due to the simple structure of the component machines. However, in many cases, the number of ILP constraints did correspond roughly to the CPU time required to generate those constraints. The CPU times for constraint generation were signi cant but within the range of interactive performance, and exhibited a superlinear but reasonable trend for a variety of metrics of problem size (number of states, number of constraints, etc.). Recently-proposed algorithms for scheduling via BDDs 23, 24] have shown CPU times in the hundreds to thousands of CPU seconds for examples with one component machine.
The interaction between two communicating machines may in general exhibit phases of tight coupled interaction characterized by frequent communication and phases of loose or no interaction where both machines are executing almost or completely independently. The equivalent BFSM representation for loosely-coupled concurrent behavior yields large next-state functions since it has to capture concurrent conditional behavior which is only synchronized at few communication points. Collapsing can only deal with tightly coupled machines: the less coupled the machines are the larger and more cumbersome the representation of the equivalent network behavior becomes. Results for collapsing have been presented elsewhere 1]. In that work, collapsing was used for propagating interface timing constraints to the machine under design. The interface descriptions in those examples were small and were characterized by tightly-coupled communication.
The scheduling algorithm presented here does not su er from the limitation just described since we are only concerned with the points of communication and don't need to accurately represent the state of the network or the transitions between states in the network. While we may search many product states during communication constraint generation, those states do not appear in the constraints handed to the ILP algorithm. Thus, our algorithm partially avoids state explosion problems.
Conclusions
This work introduces new methods for the generation of constraints in systems of communicating machines. The constraints which govern communication between component machines can be generated e ciently, as shown by the results presented in the last section. The most important limitation of this scheduling technique is that it relies on integer linear programming. While ILP is general, it can be very slow in the worst case. We have shown the cases for which constraints inducing non-integral solutions occur; these cases occur relatively infrequently. When such constraints are avoided, ILP packages tend to run quickly, as shown by our experiments. As mentioned previously, related work on graph-based algorithms for scheduling constraints in state transition graphs show that industrial-strength scheduling problems can be solved e ciently.
In Section 9 we saw that resource constraints could be handled with a 0-1 ILP formulation.
Exclusivity timing constraints (y x + 2 or y x ? 2) can also be dealt with in a similar fashion using a 0-1 ILP formulation. However, the large number of variables required would make this approach too slow if a large number of exclusivity constraints is present.
Our new approach preserves the designer's partitioning intact avoiding the potential size explosion incurred by the previous algorithm that involved collapsing of the network. Using this approach we accurately model complex control and communication. By carefully generating a small number of constraints, we can speed solution time while satisfying system behavior.
