A main advantage of control composition with modal processes [4] is the enhanced retargetability of the composed behavior over a wide variety of target architectures. Unlike previous component models that hardwire the coordination behavior either explicitly in the components or implicitly in the underlying model of computation, modal processes decouple component functionality and coordination protocols. Retargetability is achieved through the synthesis of distributed mode managers, which abstract away low-level synchronization and control communication details that would otherwise be exposed to the component designer. This paper presents an algorithm for the synthesis and optimization of distributed coordination controllers by computing an optimal projection of the global state space onto each processor. It not only minimizes interprocessor communication traffic for coordination but also reduces controller complexity by minimizing replication.
INTRODUCTION
In IP-based design, designers must be concerned with the integration of high-level components. A central issue in system integration is that components must have compatible protocols. Otherwise, a variety of "glue" mechanisms must be inserted for protocol translation. Now, the term "protocol" applies to many levels. At the low level, designers are concerned with signaling on the pins and the datalink layer. Going up the protocol stack, another layer might be concerned with packetization or session establishment. What is usually overlooked is that above the communication protocol stack, the components must also agree to another kind of a protocol, namely coordination. Coordination governs what different components must do to perform a task collectively. For example, one component may perform one computation while the other component handles communication, or a set of components may run in low-power mode, which may necessitate other components to reconfigure accordingly.
The difference with coordination is that, unlike lower-level pro-tocols, which are concerned with the transport of communication messages or signals, coordination protocols cannot be easily separated out with an API; instead, they are deeply ingrained in the component's functionality. As a result, they are the main cause for component modification and are becoming a serious obstacle to IP reuse.
The modal process model [5] was proposed to address some of these fundamental problems in IP reuse. It defines each component's coordination behavior declaratively, rather than imperatively; moreover, this representation for coordination is composable. The implication is that the coordination controllers required for system composition can be synthesized and optimized for each composition. Another key benefit is that this separation of policy and mechanism exposes many optimization opportunities for distributed target architecture. It is thus the goal of this paper to explore the synthesis and optimization of distributed embedded systems modeled with modal processes.
This paper first provides a brief review of modal processes and compares them with other approaches. Next, we propose a number of strategies commonly considered in partitioning. We then present an algorithm for optimizing interprocessor control communication in distributed architectures. The results from applying this algorithm are shown and discussed.
RELATED WORK
Today's component models can be classified many ways. They can be either platform-based or interface-based [13] . They can also be domain specific, and normally this means either data-dominated or control-dominated.
Platform-based components are designed for integration on specific implementation frameworks or infrastructures. These could be specific boards or busses; in the case of software components, the platform is usually the middleware or the operating system. Platforms are a fast way to assemble systems that can be operational shortly. In addition, platforms can be long lasting, and they may support system evolvability and incremental upgrade. They must standardize on protocols at multiple levels and often also fix architectural assumptions. While this may be desirable, the overhead may be prohibitive for small, cost-conscious or high-performance embedded systems.
Interface-based components are only known by their outside interfaces, which may be defined at several levels as well. They are not tied to specific platforms. Their interfaces are flexible or ab-Permi ssion to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2000, Los Angeles, California (c) 2000 ACM 1-58113-188-7/00/0006..$5.00 stract, such that one or more layers of the protocol stack can be replaced without affecting the component functionality [8] . However, synthesis is required to map the abstract constructs to those in the concrete platform before the system is operational.
Today's component models, whether platform-based or interfacebased, commonly force designers to express coordination as an inseparable part of component functionality. This is especially true with most control-dominated, FSM-like models (StateChart and variants [15] , Esterel [2] , SDL [16] ) or object-oriented models composed by method calls. Attempts to capture coordination using hierarchical state machines have resulted in overlapping states [10] , which are difficult to understand and reuse. Another approach is to limit the components to specific classes of coordination protocols. Several abstract models exploit patterns in the behavior to enable optimizations. For example, synchronous dataflow [12] (SDF) coordinates by data dependency and has fixed input-output correlations; communicating sequential processes [11] (CSP) coordinates by rendezvous and has the property of speed independence. These properties enable the optimization of the coordination mechanism, usually in the form of a static scheduler that incurs no runtime overhead. However, these are domain specific solutions.
Several models have been proposed to address the problems with hardwired coordination protocols. Synchronizers [9] are a way of extracting synchronization policies from the objects and enabling their substitution. Mediators [14] bind method calls to events such that coordination changes need to be reflected only in the binding, rather than in the components. Both are software frameworks and thus they are not amenable to automatic, topology-specific optimizations or real-time scheduling. As a more abstract, interface model, DCCA [1] is perhaps the closest to our approach in that components are detached from their coordination behavior, which is stated in boolean algebra terms and synthesized as distributed controllers. They support a broadcast run-time environment, although broadcast may not be suitable for all architectures. Our proposed technique does not require broadcast and should be able to readily complement their work. Our modal processes model takes an interface-based approach to high-level component modeling. It lets designers compose components by specifying correlations on the modes of operation. These correlations capture not just synchronization or invocation but also mirroring, exclusion, sequencing, and many combinations of patterns required for coordinating concurrent processes. Decoupling coordination and component functionality enhances modularity and enables synthesis and optimization of coordination controllers for any topology, and this would not be possible with platform-based approaches.
SYSTEM DESIGN FLOW
The designer creates a high-level model of the system by instantiating components and composing them. The components, modeled as modal processes, have ports and modes on their interfaces. Ports are for data composition while modes are for control composition. As in many models, ports are connected by channels; the unique feature about modal processes is that modes are related to other modes by constraints called abstract control types (ACT). These specify the rules on how one mode change can imply another set of mode changes, and they are the primitives for defining coordination protocols in a declarative way.
A mode change is called a vote, and it is generated by the process that owns the mode. ACTs constrain modes by effectively specifying a set of transfer relations on mode changes, as their purpose is to imply additional votes. Each ACT instance can be written as actName(list of modes constrained). Each ACT is sensitive to changes to a subset of its modes, and it triggers mode changes to another subset in response. One example of an ACT is unify, which is sensitive to all mode changes and propagates them to all other modes. Another example is parent(m, c[1..n]), which mimics a hierarchical state machine: when the superstate m exits (deactivates), all children c[1..n] must deactivate; activating any child c[i] activates m as well. Another example is guardian, a variation where the children are disallowed to activate when m is inactive. The reader is referred to [3] for a more detailed list of ACTs.
Once specified, the designer then maps this abstract model onto a target architecture. We assume the designer supplies the system topology for the purpose of system optimization; other details used for communication-level optimizations are outside the scope of this paper. The separation of behavior from architecture is one way we achieve better retargetability and expose optimization options.
Once both the abstract behavioral model and its mapping to the target architecture are obtained, then the synthesis tool implements the mechanisms needed to enable the system-level integration of these components. This paper deals with the realization of those mode relationships. When one component makes a mode change that affects the modes of components on remote processors, those components need to be notified of the mode change with a message, to which they may respond with acknowledgments or by synchronization. These control messages are dependent on not only the target architecture but also on how the designer maps the components to that architecture. By automating the synthesis of these control messages, we further eliminate low-level, architecture-specific design tasks that are exposed to the designer in other component models today.
STRATEGIES FOR DISTRIBUTED CON-TROL
A mode manager must have access to a projection of the system configuration, where a configuration is a bit-vector representation of the component modes. Different projections are kept coherent by means of interprocessor communication. To take full advantage of distributed architectures, a partitioning algorithm should minimize communication among mode managers and minimize projection sizes. However, in general it is not possible to satisfy both goals, and the designer must make tradeoffs. This section considers a few strategies for replication vs. communication tradeoffs.
We will use a robot example to illustrate the concepts. We assume that the robot has five processes that are mapped onto three processors: joystick and pilot processes are on processor P1, bumper process on P2, and sonar and wheels processes on P3. Several partitionings of the mode manager are possible for a given process partitioning. The semantics of the actual ACTs should be of no concern.
Minimal replication
To minimize replication, one can host an ACT on the processor that hosts the most modes constrained by the ACT. This approach requires no replication of ACTs. For example, in Fig. 1 Voters for modes homed on processor 1
Voters for modes homed on processor 3
Voters for modes homed on processor 2 broken arbitrarily, and replicated modes are those constrained by the ACT but not homed locally. No ACTs are replicated, and the choice of host for each interprocessor ACT minimizes the number of replicated modes. The receiver replies with an ACCEPT or DENY message.
Minimum replication incurs more communication than necessary, and thus is expensive for distributed topologies. Consider the case when switching to manual mode, P1 needs to communicate +M (i.e., activate mode M) to P3, and ;A (deactivate mode A) to P2. On P2, uA transitively propagates ;A to mode S, and the ACT gS implies additional mode changes f;F, ;R, ;Tg, which are in turn communicated to P3. It is slow because the communication and replies are propagated serially: P3 must accept these messages, reply to P1 and P2, and P2 in turn replies to P1. P2 also communicated f;F, ;R, ;Tg unnecessarily, since it would have been possible to deduce them from the gS ACT locally from ;S.
Maximal ACT replication
One attempt to reduce communication is to transmit only original mode changes initiated by the components, by maximizing local evaluation of transitive votes. indirectly on the modes they host.
A possibly surprising result is that maximal replication does not necessarily minimize communication, either. For example, in Fig. 3 suppose the system is initially in D mode when a mode change of +E is requested. P3 is told to go to M mode; however, the communication to P2 is wasted, because it changes only the replicated modes without affecting any modes hosted on P2. In other words, it fails to exploit temporal don't-cares and results in unnecessary communication.
Optimization strategy
Optimal partitioning is somewhere between the minimal and maximal replication schemes. Standard min-cut partitioning algorithms can be applied to determine the optimal cross-section bandwidth required in the worst case; however, the actual bandwidth may be smaller, as only a subset of votes is ever needed at one time. An optimal partitioning would project just enough ACTs and modes to eliminate wasted communication. 
MODE MANAGER PARTITIONING
This section presents an algorithm for minimizing communication as the primary objective and reducing ACT replication as the secondary objective. The main idea is to start with a full projection onto each processor, and then find a minimum cut in each projected ACT graph for reducing communication. An allocation α is a function that maps a process π 2 Π to its processor number 1 n], and we say that the process π is homed on processor α(π). Because a mode belongs to exactly one process, we overload the function α to map a mode to its processor ID, without ambiguity, namely α : M ! 1 n]. The mode m is then said to be homed in process π on processor α(m) = α(π). The algorithm projects the modes and ACT instances onto the individual processors according to the allocation. The output consists of the projected ACT-instance graphs for each processor and the interprocessor communication edges E C .
Representation

Algorithm
The algorithm is shown in Fig. 6 . It uses the MAX-FLOW MIN-CUT algorithm to determine the boundary for control communication that incurs the least cross-section communication bandwidth. The boundary also dictates which ACTs and modes need to be replicated. To solve this problem as an instance of max-flow min-cut, the algorithm constructs a flow network for each processor. A flow network is an abstracted representation of the ACT-instance graph, with local modes and local ACTs locked down as the sinks, and a set of remote modes and ACTs as the sources. Once the cut is determined, the vertices (modes and ACTs) on the source side remain remote, while those on the sink side must be replicated if not initially local.
Flow Network Construction
The corresponding flow network can be constructed based on the ACT instance graph. The number of votes an ACT can cast is represented as the edge capacity between a pair of nodes (in, out) in the network. For example, the ACT guardian(E, [F, R, T]) ( Fig. 5(a) ) can be represented as a pair of vertices with an edge of capacity 3, the number of modes that this ACT can register ( Fig. 5(b) ). All incoming arg-links (Sec. 5.1) in the ACT instance graph correspond to incoming edges to the in-node; all outgoing arg-links correspond to outgoing edges from the out-node. One difference is that if a mode argument fans out to several ACT instances (as in Fig. 5(c) ), then the out-node must first be connected to a new node with a one-capacity edge before fanning out ( Fig. 5(d) ). Thus, a single voter cannot generate more than one vote. As a shortcut, arg-links between the same pair of ACT instances can be grouped into a single edge with the capacity equal to the number of arglinks.
Partitioning Loop
The same network is used for the partitioning of all processors, except different nodes are designated as sources and sinks. The sources and sinks are a mechanism for the algorithm to lock in vertices that are fixed in a partition. For notation, M i is the set of modes homed on processor i. The set of vertices V consists of all the ACT instances, and V i is the subset of those ACTs v that constrain only M i . The set X = V ;( S n i=1 V i ) contains ACTs that constrain modes across processors. For the purpose of max-flow min-cut on processor i, the nodes that correspond to the vertices in V i are marked as the sinks, and they are all connected to a unique supersink by an edge with infinite capacity. The sources in the flow network are the remote modes W (i) = fm j 2 M j i 6 = jg whose processes can indirectly vote on M i . The set W (i) can be determined by simply following the arg-links in V i in reverse in the ACT instance graph. A supersource node can be added to the flow network.
To compute the max-flow min-cut, several standard algorithms can be applied, including Ford-Fulkerson or the Edmonds-Karp implementation [7] , which runs in O(V E 2 ) time. The cut c i = ( S T i ) determines the projection: S is the set of vertices on remote processors, and T i is the set for the local processor i. The cut set represents the votes that are registered and transmitted by remote processes or ACTs to the local processor i. Theoretically, the max-flow min-cut algorithm would allow votes to propagate in the backward direction, but in this construction, they are not possible because those communications would have been eliminated by ACT duplication. Note that T i includes nodes for V i by definition, but it can also include nodes that correspond to members of V j6 =i and of interprocessor ACTs X. While each intraprocessor ACT v i has a welldefined home processor (namely i), the interprocessor ACTs x 2 X do not have a predetermined home processor, and at least one must be assigned. If a cut T i does not include an x 2 X, then x must be implemented on at least one of the other processors j 6 = i. The partitioning algorithm is run n times for n processors.
Example
Consider the simple hierarchical FSM-like example shown in Fig. 7 . The system has six modes, fB, C, D, E, F, Gg. For the constraints, modes B and C are constrained by the mutex ACT m1([B, C]), and they are also constrained as parent ACTs g1(B, [D, E]) and g2(C, [F, G]), respectively. Suppose the designer partitions the modes into two sets, fB, D, Eg and fC, F, Gg.
To construct the flow network, each ACT instance is turned into a pair of vertices connected by an edge with the capacity equal to the number of modes. For example, each processor can vote on three modes (fB, D, Eg on P1 and fC, F, Gg on P2). They are shown in the flow graph with a capacity-3 edge between the two vertices that are enclosed by the corresponding dashed box. The ACTs g1 and g2 are similarly constructed. The mutex ACT m1 can register two votes, but the votes are individually fanned out. Therefore, additional fan-out nodes B and C are introduced to limit the edge capacity to one unit.
To determine the projection on P2, the sinks for the flow network are first chosen. They correspond to V 2 's ACTs, namely g2 and the two polar vertices for modes homed on P2. The source nodes for the flow network can be determined by following the edges in the ACT instance graph backwards from V 2 to all remote voters, and these are P1's source ACT. Two cuts have the same max flow of 1, as shown in Fig. 8 . The cut shown on the left indicates that P1 should transmit activation of B to P2 for the evaluation of the mutex ACT m1, which is implemented on P2. The one shown on the right does not have a copy of m1; however, it assumes P1 transmits the activation of C to P2 after having evaluated m1 on P1.
CONCLUSIONS
This paper presents the synthesis and optimization of distributed coordination controllers, or mode managers. Today's models require designers to sprinkle control messages for coordination, which can be architecture specific, error-prone, and difficult to change. By synthesizing and optimizing these coordination mechanisms, we enable design space exploration and enhance component reuse in control-dominated applications.
Unlike other automated partitioning tools, our technique is agnostic to functional partitioning, which may be user-guided or automated. Instead, we automate the optimization of coordination, which can be cleanly separated from component functionality, and this is made possible by the component model with modal processes.
This tool has been integrated into an embedded systems codesign framework [6] , with specific support for control composition. Such a temporal approach is important for real-time constraints. Moreover, in low-power systems, components must be able to coordinate the power modes, in addition to coordinating functionality. Our work represents a first step towards enabling higher level design in this increasingly more complex problem space. Our future work includes not only addressing the issues specific to power coordination but also integrating our control-dominated coordination with formal dataflow models.
