Abstract| The scope of most high-level synthesis e orts to date has been at the level of a single behavioral model represented as a control/data-ow graph. The communication between concurrently executing processes and its requirements in terms of timing and resources have largely been neglected. This restriction limits the applicability of most existing approaches for complex system designs. This paper describes a methodology for the synthesis of interfaces in concurrent systems under detailed timing constraints. We model inter-process communication using blocking and nonblocking messages. We show how the relationship between messages over time can be abstracted as a constraint graph that can be extracted and used during synthesis. We describe a novel technique called interface matching that minimizes the interface cost by scheduling each process with respect to timing information of other processes communicating with it. By scheduling the completion of operations, some blocking communication can be converted to non-blocking while ensuring the communication remains valid. To further reduce hardware costs, we describe the synthesis of interfaces on shared physical media. We show how this sharing can be increased through rescheduling and serialization of the communication. In addition to systematically reducing the interface synchronization cost, this approach permits analysis on the timing consistency of inter-process communication.
I. Introduction
Past e orts in high-level synthesis have focused primarily on the synthesis of a single process 1, 2, 3, 4] . Under this assumption, hardware behavior can be represented as a control-ow and/or data-ow graph, and tasks such as scheduling and binding are de ned with respect to operations within a single process. While this assumption is adequate for uniprocessor synthesis, it is less e ective in synthesizing more complex circuits and systems that are modeled best as multiple concurrent and interacting processes.
There are many examples of designs consisting of multiple interacting processes. Consider for instance a graphics enhancement unit modeled by two processes: an edge detection process and an image enhancement process. The edge detection process takes as input a stream of data representing the incoming image and detects its edge boundaries. This boundary information is then passed to the Manuscript received Dec. 16, 1992 ; revised April 14, 1993 . This work was supported by NSF/ARPA, under grant MIP-8719546, by DEC jointly with NSF, under a PYI Award program.
Claudionor N. Coelho, Jr. was supported by CNPq-Brazil under contract 200212/90.7.
The authors are with the Center for Integrated Systems, Stanford University, Stanford, CA 94305. second process to be used for enhancement of the image via shading or highlighting. Figure 1 shows a block diagram of this design. In general, the timing with which the edge detection process produces results is unknown, and therefore blocking communication is needed to coordinate its operation with its receiving process. However, it is possible in some cases to take advantage of the timing characteristics of the receiving process to reduce the amount of handshaking, and vice versa. In this case the combination of the sender and receiver allows the communication between them to be matched together, such that the result is a less general but simpler interface. The ability to exploit timing behavior of other processes during synthesis is key to obtaining an e cient implementation. Other examples include designs that interface under a given protocol (e.g., NuBus or EISA bus) and telecommunication applications. In these cases it is possible to specify generic interfaces without exact timing information using blocking communication. When such a process is synthesized in a system with a particular bus model, the timing information from the bus model is re ected back to the generic interface and the communication can be simpli ed by specializing it for the particular application.
Modeling a system as a collection of concurrently executing processes poses additional challenges to a synthesis system. In particular, synthesizing one process can in general alter the way it communicates with its environment. These changes in turn a ect and constrain the synthesis of other processes in the system. The correctness of a design depends not only on the correctness of its data computations, but also on the timing and synchronization requirements that de ne when these results are communicated to and from the external environment. Of critical importance are the analysis and synthesis of the interfaces between the processes and the protocol governing their interaction, as well as their e cient implementation on shared physical media.
With few exceptions 5, 6, 7, 8, 9] , existing techniques do not adequately address the synthesis of communication for concurrent systems. This paper presents a methodology for the analysis and synthesis of interfaces for time-constrained concurrent systems. Such systems are characterized by tightly interacting processes operating under strict timing and sequencing constraints. We abstract the inter-process communication using blocking, semi-blocking, and nonblocking messages. These messages are transferred over abstract communication channels, which are mapped into a physical implementation (e.g. buses) by the synthesis process. This is in contrast to approaches where the communication is achieved structurally through the use of ports. We consider only point-to-point communication because of its determinism (i.e. each message operation communicates with exactly one other operation) and lack of arbitration, both important characteristics for time-constrained designs. We represent the timing and sequencing relationships between messages using a graph abstraction called a message dependency graph. This graph e ectively captures the sequencing dependencies in the communication protocol and serves as the basis to analyze communication deadlock and cross-process timing requirements.
We present a novel technique called interface matching that minimizes the interface control logic and inter-process handshaking by scheduling each process using the communication patterns and timing behavior between concurrent processes. Our technique is guaranteed to yield the minimum amount of required explicit handshaking for a class of designs. In order to further reduce hardware costs, we describe the synthesis of communication on shared physical media. We show how this sharing can be increased through rescheduling and serialization of the communication.
The problem of reducing the amount of synchronization in concurrent processes has been studied in various forms. In the area of hardware synthesis, the approach described in 10] uses appropriately sized queues to decouple the sending and receiving processes. The use of queues allows processes to proceed with their execution without having to completely synchronize via blocking. This comes at the expense of increased implementation costs due to the added queues and increased control complexity. Our approach seeks to minimize the implementation cost by eliminating the need for queues by matching communication patterns between processes whenever possible.
In the area of concurrent software, the problem of minimizing synchronization has been explored at various levels 11, 12, 13, 14] . The most relevant work to our problem is presented in 12] and addresses the problem of solving the synchronization problem for barrier MIMD machines 15] . In this problem, barriers are used to synchronize instruction streams running on concurrent processors. Although their model does not apply directly to our hardware synthesis problem, the goal of reducing the amount of synchronization through static resolution is the same as ours.
A block diagram of the proposed synthesis process is shown in Figure 2 . The rst step is to schedule all of the processes independently using traditional scheduling techniques. This scheduling information is subsequently used by interface matching, which considers pairs of processes and attempts to simplify their communication. The result of matching may require the individual processes to be rescheduled, which can in turn lead to additional matching. Finally channel merging is applied to reduce the number of physical channels. These steps will be described in detail, and it will be shown that this iterative optimization technique is guaranteed to terminate. The paper is organized as follows. Section II describes our model of hardware behavior and inter-process communication. The extraction of interfaces using message dependency graphs is described in Section III. Section IV describes the interface matching technique to reduce interprocess synchronization. The merging of communication channels is described in Section V. Finally, we conclude with experimental results and directions for further research.
II. Modeling Inter-process Communication
The choice of a hardware model largely impacts the scope and applicability of the synthesis algorithms. This work assumes that hardware is described using some generic hardware description language (HDL) that supports concurrency and interprocess communication. The HDL is compiled into a control/data-ow graph, which can then be optimized using the techniques described in this paper.
The sequencing graph model 16, 17] is used in order to leverage o of existing work. This model satis es the necessary requirements and supports the explicit representation of detailed timing constraints and synchronization. We rst give a brief overview of the sequencing graph model, then describe the extensions we have added to model inter-process communication. In this paper we restrict our focus to non-pipelined, synchronous designs.
A. Modeling a single process.
We model a single process as a polar, hierarchical sequencing graph, denoted by G s (V s ; E s ), where the vertices V s represent operations to perform, and the edges E s represent the sequencing dependencies among the operations. The sequencing graph is acyclic because loops are broken through the use of hierarchy. A process starts execution at the source vertex, executes each vertex according to the sequencing dependencies, and restarts execution upon completion of the sink vertex. The execution delay of a vertex v i , denoted by (v i ), can be xed or data-dependent. The delay associated with a xed delay operation depends solely on the nature of the operation, e.g. addition or register loading. In contrast, the time to execute a data-dependent delay operation may change for di erent input data sequences, e.g. waiting for the assertion of an external signal. We call the set of data-dependent delay vertices (including the source vertex) the anchors of G s (V s ; E s ), denoted by the set A V s .
Detailed minimum and maximumtiming constraints can be speci ed between pairs of operations in the graph. In particular, consider two vertices v i and v j with start times T(v i ) and T(v j ), respectively. A minimum constraint l ij 0 between vertices v i and v j implies T(v j ) T(v i )+l ij , and a maximum constraint u ij > 0 implies T(v j ) T(v i )+u ij . We derive a constraint graph G(V; E) from a given sequencing graph G s (V s ; E s ) as the basis for timing analysis; the vertices are identical (i.e. V = V s ), but the edges E are now weighted. For a given edge (v i ; v j ) 2 E, its edge weight w ij corresponds to a timing constraint on the activation of the two operations v i and v j . Speci cally, sequencing edges (E s in the original sequencing graph) and the set of minimum timing constraints fl ij g are converted into forward edges E f E, and the set of maximum timing constraints fu ij g is converted into backward edges E b E. Forward (backward) edges have positive (negative) weights and represent minimum (maximum) timing requirements.
Example 1: Consider the network packet decoder example in Figure 3 . It consists of two processes: the main decoding process and a second process that assembles the data to present to the microprocessor. Figures 3(b) shows the Verilog description for the main decoder process. The corresponding constraint graph for this process is shown in Figure 4 . The vertices are labeled according to their corresponding operation in the speci cation, and the source and sink vertices are labeled s and t respectively. In addition to the sequencing constraints, both minimum and maximum timing constraints are shown in this example. Consider the send operation labelled c. The speci cation requires that the operations b and c must be sequential; therefore a sequencing constraint exists from b to c. In addition, a maximum constraint between c and d requires c to begin execution no more than 2 cycles after d begins execution. The minimum constraint from c to e forces e to begin exe- cution no less than (c)+1 cycles after c begins execution. In other words, e must begin execution no sooner than 1 cycle after the completion of c. 2 Given the constraint graph for a process, we can synthesize an implementation that meets the required timing constraints, or detect if no such implementation exists, using relative scheduling 16, 17] . We review now some relevant background on relative scheduling. Recall from above that anchors are operations with data-dependent delays. Example 2: Figure 4 shows the constraint graph for the decoder process from Example 1 under a given pro le of execution delays. The xed delay operations have delays of (a) = 2, (b) = (d) = (f) = 4, and (h) = 1. The bold vertices denote the anchors of the graph, and the bold edges represent forward edges weighted by a data-dependent delay. For example, the edge (c; e) has weight (c)+1, meaning e must wait at least 1 cycle after the completion of c. A valid schedule is given in Figure 5 , where o sets from each anchor are given. These o sets are simply the longest Since message operations represent points of interprocess communication, it is a natural point of reference for imposing timing constraints. Minimum timing constraints between message operations imply delaying the activation of messages and do not pose any problems. On the other hand, if there exist maximum timing constraints across blocking message operations, then the original speci cation may be overconstrained since a blocking operation has unbounded execution delay. With interface matching, it is possible to convert some blocking message operations to non-blocking operations by taking advantage of global communication patterns. We therefore modify the relative scheduling formulation in the following way.
Initially, any maximum timing constraints that are violated due to blocking operations are removed. After performing interface matching, these constraints are applied to the results and their consistency is checked. If blocking operations exist in the result, it is possible that some constraints will still be violated. Even if all operations are made non-blocking, the delays introduced by this process can result in constraint violations. It is shown later that under certain assumptions, if a solution exists, then interface matching will always nd a solution. Example 3: In Figure 6 (a) the maximum constraint from c to a is not valid because b is a blocking operation and its delay is unbounded, i.e. a positive cycle exists when (b) > 3. In (b) the same constraint graph is shown after transforming b from blocking to non-blocking. Although the operation was made non-blocking, its delay ( (b) = 4) is too large and the graph is still overconstrained. Finally in (c), b is once again non-blocking, but its delay is within the bounds of the constraint. Therefore, a solution exists if b can be made non-blocking with (b) < 4. 
B. Interface between processes
We abstract inter-process communication in terms of messages that are sent and received between processes over a set of abstract media called channels. Messages are assumed to be synchronous, taking one or more clock cycles to complete. Each message consists of a send operation and a receive operation. Returning to Figure 3 , the decoder process sends three messages fA; B; Cg containing the preamble, content, and parity information, respectively.
The communication can be static or dynamic. With static communication, a message has exactly one send operation and one receive operation associated with it. Multiple senders or receivers of the same message are not allowed. This means that send and receive operations can be statically matched to one another at synthesis time. In contrast, messages in dynamic communication are produced and consumed dynamically, often using queues to decouple the sending and receiving processes 10]. As stated before, we will restrict our focus to the synthesis of statically communicating processes.
A message operation can either be blocking or nonblocking. A blocking operation waits until its corresponding operation in another process is ready to execute. Once the correspondent is ready, the blocking operation executes with a xed latency. A non-blocking operation assumes that its correspondent is ready and executes immediately without waiting. Non-blocking operations have a xed latency, while blocking operations have an data-dependent delay. A blocking operation requires a control acknowledge from its correspondent to signal when it is ready, while a non-blocking operation does not require such handshaking signals.
A message, made up of two message operations, can be either blocking, semi-blocking, or non-blocking. These types of messages correspond to the cases where both, one, or none of the operations are blocking. The number of control signals needed for handshaking is two, one, and zero respectively. Non-blocking messages in our context are unbu ered, i.e. they are implemented as reads and writes to external ports without the use of queues and handshaking control logic. Non-blocking messages are useful when the sender and receiver are implicitly coordinated. As mentioned above, queues can be used to decouple the sender and receiver processes. This makes all messages non-blocking; however, queues introduce added cost to the implementation. Our goal is to eliminate the need for such queues through interface matching. If message operations remain blocking after matching, then queueing can be used in a complementary fashion.
When two processes are communicating via blocking message operations, it is possible that the processes will deadlock 18, 19] . This happens if both processes are waiting for each other to execute some message operation that will in fact not execute until the currently stalled operation completes. An obvious solution to avoid this type of deadlock is to make all message operations non-blocking, so that a process never has to stall while waiting for its corresponding operation in another process to execute. However, for unbu ered messages this implies the possibility of the receiver incorrectly sampling data before or after the sender is ready.
To ensure that data is properly communicated between the processes, we de ne a communication to be valid if two conditions are satis ed: (1) it is free of deadlocks, and (2) the send and receive operations of a message are coincident during the transfer of data for every message. Otherwise, it is an invalid communication. In other words, all messages that are transmitted are properly received. The objective of interface synthesis is twofold: to analyze the communication for deadlocks and timing constraint violations, and to exploit the degrees of freedom in timing constraints to reduce synchronization and implementation costs while still ensuring valid communication. Referring back to Figure 2 , the interface matching step reduces the hardware costs by converting from blocking to non-blocking operations wherever possible. In this conversion, operations are transformed from having unbounded and unknown delays into known delays. Hence processes need to be rescheduled after such conversions to incorporate these known delays into the nal schedule.
III. Extracting the Interface
We are now ready to formally de ne the interface between two processes. An interface describes two types of information: the causal dependencies between messages indicating whether executing one message requires the completion of other messages, and the minimumor maximumtiming relationships that must be satis ed between the messages. Obviously, any timing relationship must be compatible with the causal dependencies. Intuitively, composing two processes consists of making sure the causal dependencies are mutually compatible, as well as propagating the timing relationships between the processes.
A. Message dependency graph
Given a process and its constraint graph G(V; E), the set of message operations M = fm 1 ; m 2 ; : : :; m k g V in G represents the points at which the process interacts with its environment. We assume that all message operations have an unknown completion time (i.e. blocking), which implies they are also anchors in G, i.e. M A. Each message is composed of exactly one send operation in the sending process and one receive operation in the receiving process.
For a process, represented by G(V; E), communicating with other processes via message operations M, we de ne its message dependency graph as follows. In other words, G m captures the sequencing dependencies between message operations. Note that G m is not necessarily connected. Figure 7 (a) shows the message dependency graph for the decoder example. It is easy to show that if the original graph G is valid, then the message dependency graph is acyclic.
Since G m captures the causal relationships between message operations within a process, any valid communication between two processes must be compatible with respect to the causal relationships in the individual processes. To formalize this notion, consider two processes G 1 and G 2 communicating over a set of message operations M, with message dependency graphs G m1 and G m2 , respectively. We de ne the composition of G m1 and G m2 as follows.
De nition 2: The composition of two message dependency graphs G m1 and G m2 is a graph G m12 . The vertex set V m12 = V m1 \ V m2 consists of the common messages of the two processes. An edge (v i ; v j ) exists in G m12 provided there exists a path from v i to v j in either G m1 or G m2 .
The composition is a graph of the sequencing dependencies between common messages of two processes. As discussed previously, it is possible that the original specication results in communication deadlock. The message dependency graphs will be used to check for this condition. This initial analysis is necessary because later synthesis procedures assume that the communication is valid.
Theorem 1: Consider two processes P 1 and P 2 and their corresponding message dependency graphs G m1 and G m2 . If the composed message dependency graph G m12 contains a cycle, then the communication between the processes is invalid.
Proof: Consider two vertices v i and v j on the cycle, which correspond to the two messages m i and m j . Let the associated send operations be denoted by s i and s j with the receive operations being r i and r j . For valid communication to exist s i must be coincident with r i , while s j and r j must likewise be coincident. For the sake of contradiction, assume that message m i is valid. This means that s i and r i are coincident at some time t. The cycle in the graph implies there is a sequencing dependency from v i to v j . This means that one of the operations associated with m j cannot execute until sometime later than t. Furthermore, there exists a dependency from v j to v i . This implies that the other message operation associated with m j must complete execution sometime earlier than t. Clearly, message m j is not valid since it is not possible for its message operations to be coincident in time. By contradiction, a cycle in the dependency graph implies invalid communication.
2 If the composed graph is cycle-free, then we say the communication is consistent. We state the following theorem:
Theorem 2: A consistent communication can always be made valid by making all message operations blocking.
Proof: By the de nition of blocking communication it is clear that the send and receive operation are coincident. It remains to be shown that the communication is deadlock free. For the sake of contradiction, assume that two messages m 1 and m 2 are in deadlock. This means that in one process P a , message m 1 is active and must complete before executing m 2 , and in another process P b , message m 2 is active and must complete before executing m 1 . Therefore, in process P a a causal path exists from m 1 to m 2 , and in P b a causal path exists from m 2 to m 1 . By de nition of the composed message dependency graph there exists a cycle containing messages m 1 and m 2 . However, the composed message dependency graph is acyclic. So by contradiction we conclude that the communication is deadlock free. 2 Example 4: Figure 7 illustrates the composition of the decoder example (from Figure 4) , which sends three messages fA; B; Cg, with a receiving process. Redundancies have been removed in the composed graph. The receiving process, which has not been shown, has no dependencies between the messages. This simply means that it is capable of receiving the messages in any order. In this case the communication between the processes is consistent because there are no cycles in the composed graph. 
B. Incorporating interface timing relationships
We have seen that consistency of the sequencing dependencies among message operations can be analyzed by composing message dependency graphs and checking for cycles. However, there are also timing relationships that are not represented in the message dependency graph abstraction, e.g. requiring a message operation to begin at least 4 cycles after the completion of another message. This timing information is needed by interface matching to ensure precise coordination of communication between the processes.
Transformation of a blocking operation into a nonblocking one is achieved through the use of timing information to schedule the completion time of the operation. In Figure 2 the start times for all operations are computed in the initial scheduling step. The matching step attempts to compute the as soon as possible completion times so that non-blocking operations can be used in place of the original blocking ones. This implies that a schedule for the message dependency graph of a process is needed.
For the purposes of scheduling we assume that all message operations have unknown delay (i.e. they are anchors). This assumption is made because we want to schedule message operations with respect to other ones, and in our scheduling formulation the scheduling is done relative to the anchors. In addition to the message operations, there are other internal anchors not visible to other processes, such as data-dependent loops. In general, the start time of an operation may depend on such internal anchors. 
From the schedule we see that operation c depends on (s), e depends on (s) and (c), and g depends on (e). Therefore, c and e are said to be uncontrollable because their start times depend on a non-message operation, i.e. s. In contrast, operation g is controllable.
2 In order to compute the completion times for some message operation a, we need to combine the interface schedules I (G m1 ) and I (G m2 ) of the two processes that use message a to communicate. The composed interface schedule I (G m12 ) de nes the as soon as possible completion times for the message operations and can be computed as follows. Let A m1 (v) and A m2 (v) be the anchor sets of a message operation v 2 V m12 in the interface schedules I (G m1 ) and I (G m2 ), respectively. The anchor set for v in the composed interface schedule is the union of the anchor sets:
a (v) and m2 a (v) be the o sets of an operation v with respect to an anchor a in the individual interface schedules 1 . The composed o set is computed as the maximum of the individual o sets, i.e. 
Example 6: Consider the example in Figure 8 . The top part of the gure shows the message dependency graphs for two processes P 1 and P 2 with three messages fA; B; Cg before and after composition. There are sequencing constraints from A and minimum timing constraints between B and C. The bottom part of the gure shows an execution scenario for processes P 1 and P 2 based on schedules that are consistent with the individual processes. For example, message C in P 1 is scheduled to execute three cycles after the completion of A. If messages B and C are non-blocking, then the communication is not valid because the operations associated with the two messages would not be coincident. If the messages are made blocking, then operation B in P 1 would wait one cycle until its corresponding operation in P 2 executes, and operation C in P 2 would wait 3 cycles to synchronize with its correspondent in P1. In the composed interface schedule (c), we see that if we schedule the completion of operations B and C in both processes to be 1 and 3 cycles after the completion of A, then they can be made non-blocking while still ensuring valid communication. In this case the behavior is identical to the blocking case, but the hardware cost has been reduced. boundaries. Second, it enables each process to be synthesized individually, yet with all the requirements on its interactions with other processes fully represented as explicit timing constraints. Finally, it provides a formalism to manipulate and model inter-process interactions, e.g. we can now constrain the interface by directly applying sequencing and timing constraints on the external messages; these constraints can then be re ected to the individual processes for use during synthesis.
IV. Interface Matching
Given two processes, if their composed message graph is consistent, then Theorem 2 states that all messages can be made blocking to guarantee valid communication. However, it is often the case that the communication remains valid even if some messages are made non-blocking, as seen in Example 6.
Example 7: To further illustrate this point, consider process P 1 from Example 6. A possible schedule for P 1 is shown in Figure 9 (a). Both B and C have been broken into two vertices (e.g. B s and B c ) to represent the start and completion of the operations. This allows the completion time to be scheduled under the constraints of the composed interface schedule shown in Figure 8(c) . A similar technique can be used in P 2 so that the resulting schedule is that of Figure 9 (b). A remains blocking while both B and C are made non-blocking. Although the operations start at di erent times, they complete at the same time, and a valid transfer takes place. This constitutes a signi cant savings in terms of synchronization logic. 2 We formalize this observation by introducing the interface matching problem. Consider two processes P 1 and P 2 with common messages M and a corresponding composed message constraint graph G m12 that is acyclic. Let M be partitioned into the set of blocking M block and non-blocking M non?block messages. The interface matching problem is to minimize the number of blocking messages M block while ensuring valid communication. Intuitively, interface matching converts blocking messages into non-blocking ones by scheduling the completion time of a message operation when possible, as opposed to scheduling start times in conventional scheduling. This is because the completion of message operation implies the successful transfer of information between the sender and receiver. Therefore, interprocess communication can be viewed as a set of time intervals, where the start of the interval corresponds to the start of a message operation, and the end corresponds to its completion. Successful communication requires the intervals of the sender and receiver to always overlap at some point, for all message operations. The interface matching algorithm in the next section computes the as soon as possible point of overlap between the intervals. If a solution satisfying all constraints is found, then it is guaranteed to have minimum execution delay for all input data sequences.
Reducing the number of blocking messages leads to savings in two areas. First, blocking messages are implemented with a set of handshaking signals (e.g. request and acknowledge) to coordinate the data transfer between sender and receiver. Making a message non-blocking means these handshaking signals and the associated logic and ports can be removed. Second, a blocking message has a data-dependent execution delay. This can lead to larger controller cost because of the need to synthesize busy waits in both the sending and receiving processes. In contrast, no busy waits are necessary for non-blocking messages, which can result in a simpler control implementation 20].
A. Interface matching algorithm
The overall ow of the synthesis process was rst outlined in Figure 2 . Many of the details were left out and are explained in more depth here. The algorithm for the interface matching is shown in Figure 10 . For simplicity, the low level details related to scheduling are not discussed here. Although we use relative scheduling, our formulation is independent of the scheduling technique, and other methods can easily be substituted.
InterfaceMatch(P 1 ; P 2 ) forever // compute initial schedule 1 = Schedule(P 1 ); 2 = Schedule(P 2 ); // construct graphs G m1 = ConstructMsgDependGraph(G 1 ); G m2 = ConstructMsgDependGraph(G 2 ); G m12 = ConstructComposedMsgDependGraph(G m1 ; G m2 ); // compute interface schedules Given a pair of scheduled communicating processes and a common set of messages, we rst extract and compose their message dependency graphs. Although not shown in the algorithm, if the resulting composed graph is cyclic, then the communication is invalid and no solution is possible. Otherwise, for each process an interface schedule is derived from the process schedule. Based on these schedules, a schedule of completion times is constructed to form a composed interface schedule. If there are no controllable messages in either process, then there are no blocking operations that can be converted to non-blocking, and the algorithm completes.
For each message m in the composed message dependency graph, it has two message operations m 1 and m 2 from G m1 and G m2 , respectively. The corresponding vertices in the process constraint graphs G 1 and G 2 are denoted by v 1 and v 2 . If the operation m 1 is controllable, then operation m 2 can be made non-blocking by scheduling its completion time T(v c 2 ). The completion time is set such that m 2 is guaranteed to complete after m 1 begins execution. This is possible because m 1 is controllable, and therefore its start time is known in P 2 . If m 1 is blocking, the result is a semi-blocking message where m 1 is ready rst and waits for m 2 , and if m 1 is non-blocking, the message is non-blocking where m 1 and m 2 complete at the same time. The same procedure is applied to m 2 and m 1 with their roles interchanged. Example 8: An application of the algorithm is shown in Figure 11 . In this example we are only concerned with message c. In process P 1 , the start time of c depends only on messages a and b; therefore, it is controllable. However, in process P 2 , the start time of c depends on message a and an internal data-dependent loop d (denoted by the square vertex); therefore, it is uncontrollable. Operation c 2 (split into c s 2 and c c 2 ) is made non-blocking in (b) by scheduling its completion (c c 2 ), such that its completion is after the start of c 1 for all input traces. It is not possible to make c 1 non-blocking. After transforming all possible blocking messages, the entire process is started again by rescheduling and continuing from there. It is possible that the rescheduling will introduce new controllable operations that were previously uncontrollable. This happens when the rescheduled start time of a message operation no longer depends on an internal anchor. Example 9: The need for iteration on the matching step is shown in Figure 12 . In this example we are concerned with messages b and c. In (a), b 1 is controllable while c 1 is uncontrollable because of the internal loop d. Suppose b 2 (in process P 2 not shown) is also controllable so that b 1 is made non-blocking. In this case a new dependency from d 1 to b 1 is introduced in P 1 . Because of the delay values, the start of message operation c 1 no longer depends on e; therefore it becomes controllable. This cannot be known without rescheduling the graph. When there are no controllable operations remaining, the resulting constraint graph is rescheduled for a nal time. This nal scheduling is done without ignoring any of the maximum constraints. Remember that scheduling usually ignores maximum constraints across blocking operations because they may be converted to non-blocking ones. If one of these constraints is not satis able at the end, the original speci cation is overconstrained. It should be noted that in cases where some constraints remain unsatis ed, it might be possible to add serializations and constraint lengthening to introduce new controllable operations which potentially lead to a feasible result. However, these steps have the unwanted side e ect of reducing concurrency and increasing latency; therefore, they are not applied.
Since no changes in the existing constraints are made and no new ones are added, scheduling the completion time of an operation with this algorithm does not a ect the cycleper-cycle behavior of the resulting constraint graph. The algorithm simply determines the completion time of those blocking operations that can be made non-blocking. If the operation were to be left as blocking, it would complete at the same time as the transformed non-blocking operation. So the algorithm is guaranteed not to increase the latency of the design.
Under the restriction that the latency is not to be increased, this algorithm determines the maximum number of non-blocking operations. Any operation that remains blocking is the result of a uncontrollable operation. An uncontrollable operation can be made controllable only if a new sequencing edge is added or if the value of an existing constraint is increased. Modifying or adding such a constraint will lead to an increase in latency for some input traces. Therefore, the matching algorithm presented here computes the maximum number of non-blocking messages under the constraint that the latency is not increased.
The time complexity for one iteration of the matching algorithm over all processes is O(jMjjV j), where jMj is represents the total number of messages in all processes and jV j is the average number of vertices per constraint graph.
The algorithm repeats itself if new controllable operations are formed. Typically only a few iterations are necessary to nd all controllable operations. However, in the worse case jMj iterations would be required, which results in an overall time complexity of O(jMj 2 jV j).
V. Channel Merging
Given an interface graph, the interface matching procedure reduces the number of blocking message operations as much as possible. This reduction in the number of blocking operations leads to lower communication and control costs. A further bene t is the potential for multiple separate communication channels (transferring data between sender and receiver) to be merged together and implemented on a shared physical medium. Merging is easier for non-blocking operations compared to blocking ones because they have xed as opposed to unbounded delay. Until now, an assumption was made that each communication channel is implemented using dedicated control signals along with dedicated data lines. We now relax this assumption to allow the merging of channels.
There are varying degrees to which channels can share the same physical hardware. In general, control signals can be shared separately from the data signals. Furthermore, depending on data widths, channels can be combined so that multiple channels can share the same physical channel at the same time. A more general scheme is to dynamically allocate the hardware to channels through the use of dynamic arbitration. Other variants are possible but we will consider only the static case where channels have the same size as the physical medium being mapped to. Furthermore, we assume that if two channels are shared, then both the data and control signals are shared.
Channel merging can be implemented at various stages of the synthesis ow. The most direct method, described in Section A, is to apply merging before scheduling by treating the physical channels as critical hardware resources and using serialization techniques 16] to share these resources. Alternatively, channel merging can be applied after scheduling has been performed, as described in Section B. In this case the results from scheduling are analyzed to determine where selective merging can take place. Finally, we describe in Section C rescheduling techniques to further increase the amount of merging that is possible.
A. Merging before scheduling
Merging before scheduling is the most direct method because communication channels are treated the same as other hardware resources, such as adders and ALUs. Therefore, traditional techniques used for resource sharing can be applied here with little or no modi cation. Provided the synthesis system can share critical resources under timing constraints, no special analysis is necessary to support this type of channel merging.
However, there is one important di erence between communication channels and other hardware resources. Since we assume the initial message operations are blocking, the latency of these operations is not xed. Therefore, given two candidate channels to merge, it is not enough to simply schedule the start time of the message operation of one channel before the start time of the other message operation. The reason being their completion times are unknown. A sequencing dependency between the operations must exist to guarantee that they are never active at the same time for all possible execution traces. Furthermore, this sequencing dependency must exist in both communicating processes for the channels to be merged while preserving valid communication.
An advantage of this method is that it can be used to solve the problem when the number of physical channels is constrained. Serialization can be used as a preprocessing step to ensure that the number of physical channels does not exceed the speci ed constraint. Techniques exist that nds a serialization under timing constraints, provided one exists 16]. The main disadvantage of merging before scheduling is that it requires operations to be serialized. The addition of sequencing dependencies to the constraint graph reduces the degree of concurrency, which may increase the latency of the nal implementation. To avoid this increase, the technique presented in the next section can identify channel merging opportunities without a ecting the latency.
B. Merging after scheduling
To merge channels after scheduling, the rst step is to analyze the scheduled results from the interface matching procedure. Based on the sequencing dependencies and schedules, it is possible to determine whether or not two channels can be merged. In this case merging has no e ect on the synthesized result other than reducing the hardware cost. If changes in the circuit behavior (constrained by the speci cation of course) are acceptable, further merging can be achieved by modifying the schedules. This will be discussed later.
After interface matching has been performed and the processes have been scheduled, channel merging is introduced by determining whether two given channels could possibly be active at the same time. If the message operations associated with two channels a and b are guaranteed to be mutually exclusive in time, both a and b can be merged together and implemented on the same physical channel. This analysis can be broken up into several di erent cases depending on whether or not message operations are blocking or non-blocking.
For the following cases, two messages a and b that communicate between two processes P 1 and P 2 are considered. The message operations associated with message a are denoted by a 1 and a 2 indicating the process to which they belong. Likewise b 1 and b 2 are the operations associated with message b. The anchor set of an operation is denoted by A(x), where x is some message operation. As discussed previously the anchor set represents the set of operations having data-dependent delay upon which x is dependent for its activation. The two processes are analyzed separately, and later the results from both processes are combined to determine whether channels a and b can be merged.
The simplest case to analyze is when both a 1 and b 1 are blocking operations. As described in Section A, it is enough to check whether a sequencing dependency exists between a 1 and b 1 . If such a dependency exists, the channels can be merged. Otherwise, they must remain separate.
If one of the two operations is non-blocking, say a 1 , then the operation has been split into a start vertex a s 1 and a completion vertex a c 1 . Remember that the start of a s 1 represents the start of waiting, and a c 1 represents the actual transfer of data. In this case the delay of b 1 is not known (it is blocking). Therefore, a c 1 must complete before b 1 starts, for merging to be possible. Operation b 1 cannot be scheduled rst because its completion time is unbounded. In order to guarantee that a c 1 completes before b 1 , several conditions must hold. First, the anchor sets of the operations must have the relation A(a c 1 ) A(b 1 ) . This means that all data-dependent operations that a ect the start time of b 1 must also a ect a c 1 . Otherwise, it is possible that the completion of a c 1 will be delayed beyond the start of b 1 , due to some other anchor. The second condition is that for each anchor in A(a c 1 ), its o set to the completion of a c 1 must be less than the o set to b 1 . This ensures that for every possible execution trace, a c 1 always completes execution rst. If these two conditions hold, the two operations can be merged.
Example 10: Figure 13 illustrates the necessary conditions for the case of one blocking and one non-blocking operation. In Until this point, only a single process has been considered. However, for two channels to actually be merged, it must be the case that the operations can be merged in both processes. So for channels a and b to be merged, a 1 and b 1 must be merged in process P 1 as well as a 2 and b 2 in P 2 .
Further analysis is also needed in order to merge more than two channels. Once all pairs of messages have been checked, a merge compatibility graph is formed where the vertices represent the messages, and an edge between two messages implies that the two can be merged. Two or more messages can be merged if and only if there exists a clique in the merge graph that contains the messages. Clique partitioning is performed on this merge graph to determine which messages are merged together. A physical channel is needed for each clique in the partition. Example 11: In Figure 14 (a), the message constraint graph for some process P 1 is shown. For simplicity, assume the constraint graph for a second process P 2 is identical.
Of the ve messages in the graph, three of them fa; b; cg are uncontrollable because they depend on a non-message anchor, while the other two fd; eg are controllable. Therefore, the results from interface matching would yield three blocking and two non-blocking message operations in both processes. Analysis of this graph leads to the merge compatibility graph shown in (b) for both processes. Channel a can be merged with c and d, and b can be merged with d and e because of sequencing dependencies. Channels a and b cannot be merged because they are both blocking and no sequencing dependency exists between them. Channel c cannot be merged with either d or e because of incompatible anchor sets. The anchor sets of d and e have the relation that A(e) A(d), but the o set from b to d is less than to e. The minimum clique covering of the merge graph results in three physical channels. For example, a possible merging would be fa; cg, fb; eg, fdg. From the conditions for merging discussed in the previous section it should be clear that there is a better chance to merge non-blocking operations as compared to blocking operations. This means that the interface matching technique, by creating non-blocking operations, increases the ability to merge channels. Furthermore, the conditions also imply that additional steps can be taken to augment the merging process. Serializing the messages by adding sequencing dependencies, modifying the anchor sets, and altering the scheduling o sets can be used to satisfy the conditions. The modi cation of anchor sets is accomplished by adding sequencing edges and/or modifying o sets, so we are left with these two techniques to improve the merging step.
There are several disadvantages to adding edges and increasing o sets. Although these techniques can be used in some cases to reduce control costs 20], they can only increase the latency and possibly increase the control costs. In the case of serialization, there are drawbacks in addition to the increased latency. Adding sequencing edges has the e ect of reducing the degree of concurrency. Furthermore, these new edges can potentially change the anchor sets of operations in the graph. In some cases this is tolerable, but in our case modifying the anchor set of an operation can cause it to become uncontrollable. If this occurs, then the results from the matching step would be invalidated. Some non-blocking operations would have to be changed back to blocking. Therefore, there is no attempt to introduce new serialization after scheduling has been performed, and this technique should only be used before the interface matching procedure is applied.
So in order to obtain increased merging, we are left with modifying the existing schedule by increasing the value of constraints. It was shown that for two operations a and b to be merged the relation must hold between the anchor sets A(a) and A(b). If this relation does not exist, there is no need to reschedule the operations because they can not be merged. Therefore the rst step is to partition all the anchor sets into sets such that for each partition, an ordering among the anchor sets exists under the relation. For example, consider a case where we have the four message anchor sets fw; xg, fx; yg, fw; x; yg, and fw; x; zg. We can partition these into two sets ffw; xg; fw; x; zgg and ffx; yg; fw; x; ygg where the relation holds. Example 12: The example in Figure 15 is taken from Example 11. The only di erence in this example is that rescheduling has been performed on operation d. We saw in the previous example that the anchor sets of d and e had the relation A(e) A(d). However, the schedules of these operations did not permit them to be merged. Operation d has been delayed by 3 cycles with respect to b to ensure that e will always execute before d. Doing this allows the two operations to be merged. The resulting merge graph is shown in (b). Now the minimum clique covering results in only two physical channels. The merged channels in this case are fa; cg and fb; d; eg. The second step is to reschedule the operations based on how their anchor sets have been partitioned. The algorithm to partition the anchor sets and reschedule the operations is shown in Figure 16 . The algorithm is a heuristic that will nd a new schedule such that maximum channel merging is achieved. However, it does not necessarily nd the schedule with minimum latency. Due to the presence of unbounded delay operations, the meaning of minimum latency is not clear. The time complexity of the algorithm is O(jAjjMjjV j 2 ), where jAj is the number of anchors, jMj is the number of messages, and jV j is the number of overall vertices. If processes P 1 and P 2 contain any control-ow structure such as loops and conditionals, then their corresponding constraint graph representations are hierarchical in the sequencing graph model we are using. This causes diculty in several areas. First, in our formulation timing constraints can only be speci ed between vertices of the same graph, which means that if we specify constraints between messages occurring in di erent graphs, they must be distributed across the hierarchy. Second, since we synthesize each graph in the hierarchy separately, it is necessary to partition the message links such that messages within each partition exist solely between two graphs in the hierarchy. Interface composition is then applied to each partition in turn. Since the temporal relationship between operations across the graph hierarchy is not directly captured, there is a possible loss of accuracy in the hierarchical extraction of timing relationships.
Until this point, we have ignored hierarchy and have only considered the communication between two distinct graphs in separate processes. Our methods can be applied to any two graphs at any level of hierarchy; however, none of the procedures work across hierarchy boundaries. We can deal with these problems in two ways. First, we can restrict as much of the communication as possible to a single graph in the hierarchy. Typically this would be the top level of the process. This has the limitation that the control reduction and channel merging techniques cannot be applied between communication events that occur at di erent levels of control hierarchy. A second approach is to reduce the hierarchy as much as possible by attening the control structures. This could be done for most of the control structures except for loops which must be represented through hierarchy.
Although our techniques work on multiple processes, only simple point-to-point messages are supported. Messages with multiple senders or receivers (e.g. broadcasts) are not considered. More work is necessary to support other types of messaging and synchronization, which in some cases is highly desired. This would require the analysis of more than two processes simultaneously, which is currently not supported.
There are no restrictions on the relative repetition rates of processes for interface matching to be used. This is in contrast to some synthesis systems that assume all processes start and restart at the same time. This allows system designs that have a combination of processes iterating at varying rates. Communication has the e ect of synchronizing such processes, but in general the processes remain synchronized for only a short time after completion of the communication. Interface matching takes advantage of this time when the processes are synchronized to simplify other communication. But once the processes cease to be in synchrony, no further optimizations can be achieved.
A limitation in the matching algorithm is that blocking operations are converted to non-blocking only if the latency of the result is not increased. So in the case when an increase is tolerable, the algorithm might not nd the maximum number of non-blocking operations. This leads to a problem in the satisfaction of maximum timing constraints across blocking message operations. These types of constraints are allowed in the speci cation, but it is not always possible to nd a solution satisfying all the constraints. If a solution is not found, it may be the case that the speci cation is simply overconstrained; however, it may also be the case that a possible solution was not found because it would result in increased latency. The solution to these problems is to allow the matching step to selectively allow increases in the latency when necessary. The rst example consists of two processes in a system that models the transmission of digital data through a lossy serial line. The encoder process prepares the data and the associated parity information. This data is sent on a lossy line to a decoder process which uses the parity information to correct transmission errors if possible. In this example all of the communication is serialized in the original speci cation. Therefore, after optimization the rst message remains blocking, while the rest have been converted to non-blocking. Furthermore, because of the serialization, all messages can be merged together using a single port.
The second example is the elliptic lter taken from 22], which has been partitioned into two processes using the Vulcan high-level partitioning tool. After partitioning, there are three communication messages used to transfer data between the partitions. These three messages can be simpli ed down to a single blocking message with two nonblocking messages all sharing a single port.
The nal example consists of two processes from a decryption system. Due to control structures, both processes contain several levels of hierarchy in their constraint graphs. The message dependency graphs for these two processes are shown in Figure 17 . Graphs from the two processes that contain the same messages are paired together, and these pairs are processed separately. The receive process is responsible for handling the low-level details of reading data from an incoming serial line, extracting header and data information, checking for and correcting transmission errors, and sending the data o to a decryption process to obtain the clear text. The decrypt process receives data and proceeds to decrypt it based on the header information. While processing the incoming data, the clear text is sent back as it becomes available. Finally some trailer information is sent back to the receive process. Due to the concurrency in this example, the nal circuit still has multiple blocking messages. 
VIII. Conclusion and Future Work
In this paper, we described an approach to the analysis and synthesis of interfaces for time-constrained concurrent systems. We proposed an explicit representation of the interface between processes in terms of message dependency graphs. We described the interface matching technique to minimize the number of required blocking messages that is needed for valid, deadlock-free communication under detailed timing constraints. A method for sharing physical channels among multiple communication channels in order to reduce the communication hardware between processes was presented.
We are working to extend the formulation to better support hierarchy in the model. Currently, it is necessary to partition the messages such that messages within each partition originate from a single graph in the hierarchy and terminate in a single graph in another hierarchy. For many time critical designs where the control-ow structure of the sending and receiving processes is similar (to minimize the e ect of control delays), this assumption is not a severe limitation. For other designs, there is potential loss of accuracy in extracting the timing requirements because the relationship across hierarchy may be lost. A solution is increase the scope of analysis by transforming the description to reduce the number of partitions, e.g. attening or restructuring the control-ow. Another approach is to extend the formalism by using automata to describe the time progression of message operations on channels. These issues are currently under investigation.
