As critical computer systems continue to grow in complexity, the task of showing that they execute correctly becomes more difficult. For this reason, research in software engineering has turned to formal methods, i.e., rigorous approaches to demonstrating the correctness of software systems. Unfortunately, the formal methods currently used in the design of concurrent systems do not provide any mechanisms for specifying and reasoning about the mapping of software to hardware. As a result, architectural constraints, even though they play an important role in the design process, are left out of the formal framework. In this paper, we show how to state architectural constraints in a formal notation, how to prove that programs are allocated correctly to the underlying architecture, and how to factor architectural considerations into a program derivation process which uses a mixture of specification and program refinements. The approach is illustrated by the derivation of two related programs that solve the same problem but are designed to work on distinct architectures.
Introduction
Increasingly, computers are used to control and monitor critical systems where failures are unacceptable. In many such systems, it is necessary to provide strong guarantees that each software component will function correctly. Because these systems are large, complex, and involve multiple computers, mere testing is not sufficient for demonstrating that the software is free of errors. Consequently, research in software engineering has been forced to consider more rigorous approaches to the specification, analysis, and construction of programs. Program derivation is a promising formal approach to constructing correct programs. The approach entails a series of mathematical manipulations of formal specifications defining the problem to be solved. An initial, abstract specification is gradually refined until it becomes sufficiently concrete as to suggest a direct realization in terms of some available programming language. The correctness of the final program is guaranteed by its construction.
Informal characterizations of the target architecture, i.e., architectural constraints, are often used to justify individual refinements and to guide the derivation process towards an efficient solutionÑthe use of shared memory or message-passing architectures, for instance, may lead to very different software structures and communication protocols. Nevertheless, because the architectural constraints are stated informally, they cannot be factored formally into the derivation process and cannot be used to prove that the resulting program can actually be executed correctly on the desired architecture. Current program derivation methods simply exclude considerations regarding the target architecture from the formal framework. It is true that formality does not necessarily make the design easier; in fact, when the mapping is trivial, a formal treatment may be entirely superfluous. There are systems, however, which demand a degree of dependability not achievable by informal means. Additionally, some of these high-reliability, high-performance systems are likely to involve specialized devices and novel architectures that will render inadequate informal reasoning about architectural constraints. (While the examples presented in this paper do not reach such a level of sophistication, one of them does illustrate the complications brought about by the use of specialized devices.) Finally, even for a system design where the mapping is treated in an informal manner one may want to prove (after the fact) that the software will execute correctly on the given architecture. The techniques presented in this paper make it possible to do so. This paper shows that architectural constraints can be expressed formally using the same notational and logical system employed to specify behavior and that architectural constraints can be used to guide the process of constructing a program appropriate for the desired architecture. The methodology is two-phased. Specification refinement is employed to construct a program which is correct but architecture-independent. Program refinement is used to transform this program into another which satisfies the behavior specification as well as formally stated architectural constraints ignored during the specification refinement phase. The compatibility between the program and the target architecture is guaranteed by construction. The program transformations are creative steps motivated by violations of the architectural constraints and often trigger limited re-verification of certain behavioral assertions.
The amount of re-verification is kept small by carefully monitoring the relation between program actions and the assertions whose validity they may impact. A technique for formal specification and verification of architectural constraints is an integral part of this approach.
This formal treatment of architectural considerations is an important first step towards a better understanding of fundamental issues dealing with the mapping from software to hardware and a prerequisite for eventual development of practical tools. It is from this perspective that we approached this work. We wanted to determine if a formal treatment of architectural constraints is feasible at all and if some of the program derivation techniques already in use could be applied to the task. Our results raise, however, a completely new set of open questions regarding which formalism is most appropriate for the task and how much of an impact a formal treatment may have on the design process. Much research remains to be done before even attempting to answer these difficult questions.
In the remainder of this paper, we elaborate on our methodology for deriving concurrent systems with an example taken from distributed simulation. Section 2 introduces the distributed simulation problem, outlines the specification method, and proposes an initial high-level specification. All specifications are given in terms of UNITY-style assertions [7] over the abstract state of the program to be developed. Section 3 traces the refinement of the initial specification up to the point where constructing a program satisfying the specification becomes a trivial task. Section 4 describes the Swarm [20] programming notation and an initial abstract program which is suggested
by the last refinement generated in Section 3. Swarm has a UNITY-like proof logic but adds the ability to deal with dynamic data and process creation and with various forms of synchrony. These features make Swarm better suited for use with our methodology than is UNITY, which favors a static program structure. (UNITY programs are a proper subclass of the class of Swarm programsÑsubject to simple mechanical translation.) Section 5 illustrates the process of adding architectural constraints to the specification. For this purpose, two example architectures are selected: (1) a bus-based, message-passing architecture with specialized hardware supporting a simple form of barrier synchronization, and (2) a unidirectional, message-passing ring. Sections 6 discusses issues raised by this method and their relation to previous work. Section 7 provides some brief conclusions.
Initial Specification
To illustrate our methodology, we will be using an example that was inspired by previous work on distributed simulation [6] . Consider a network of sequential nodes which can exchange messages over error-free communication links. Each node executes a repeating sequence of operations involving the retrieval of pending messages from other nodes followed by some local action resulting in the updating of local data and the sending of messages to other nodes over the links. In such a physical system an operation taking place at time t may depend only on actions executed and on messages received before time t. We wish to specify and design a distributed program which will simulate this system. The main difficulty in building the program rests with the fact that simulated messages may not arrive in the same order as in the physical system and the duration of simulated actions may bear no relation to that of their physical counterparts. The obvious approach to solving these problems is to time-stamp both messages and actions and to make sure that all messages that must precede some action in fact arrive at the simulated node before the start of the action. The majority of the architecture-specific refinements are involved with the proper ordering of messages and actions.
In the remainder of this section we provide a formal specification for the distributed simulation problem.
The notation used here is common to both UNITY [7] and Swarm [9, 21] and represents a special case of a lineartime logic. A program specification consists of safety and progress properties of the desired program. Safety assertions constrain the range of possible state transitions in which the program may engage; progress assertions define state transitions that are required to take place. The specification is concerned only with the program behavior and makes no references to any non-functional constraints the program may have to meet (e.g., response time, reliability, cost, etc.). Furthermore, all assertions are stated in terms of a highly abstract state representation of the program. This is accomplished by substituting references to concrete data representations by predicates whose truth values are properly constrained and whose interpretation is given informally. For example, we can express the notion that some action a is to be executed at time t at node P by the predicate action(P,t,a).
Safety properties are specified using the unless relation as in p unless q which asserts that if the program reaches a state in which the predicate p holds, p will continue to hold at least as long as q does not, which may be forever. Given the unless relation, one can easily introduce the notion of stability stable p º (p unless false) which states that once p holds, it will continue to hold forever; and the concept of invariant
which asserts that p holds initially and throughout the execution of the program. INIT characterizes the initial state of the program. Most often, the program initialization is not given explicitly but it is implied by the invariant properties.
Progress properties are generally specified using the leads-to (written Ò Ó) relation. The assertion p q simply states that if the program reaches a state in which p holds, it will eventually reach a state in which q holds.
One must note that p is not required to hold until q is established. The until relation, defined as
is used to describe a progress condition which requires p to hold until q is established.
Given a specific UNITY or Swarm program, a leads-to property is usually verified by relying on a more restricted property called ensures and by applying a certain repertoire of inference rules, including transitivity of the leads-to property. By definition, the relation p ensures q holds if (1) (p unless q) holds and (2) there is a specific program action which, if executed, establishes q and a guarantee (e.g., fairness) that the said action will eventually be executed.
This is all the notation we need to start constructing a formal specification of the problem. Of course, reasoning about specifications and understanding proofs does require a deeper knowledge of the proof logic and of the underlying computational model. More explanations and notation will be introduced as needed.
State Representation
For specification purposes, it is convenient to assume an absolute global time. The predicate gclock(T) is used to denote the fact that the current time is T. The networkÕs state can be characterized in terms of the actions which are to be executed, the messages which are to be delivered, and the local state of each node. The following predicates are used to represent this information:
The current data state of node P is s.
action(P,t,a) Action a will be executed at time t at node P. We define the special action ^ to be the halt action, i.e., when action(P,t,^) is true, node P is terminated.
Messages for a terminated node must still be delivered.
message(P,Q,t,m) A message from P to Q with content m will be delivered at time t.
The predicates introduced so far define an abstract state representation for the program. Since not all states are acceptable, several invariants are introduced to constrain the state space in a reasonable way (all free variables are assumed to be universally quantified):
F1:
The current time is unique inv. á å T : gclock(T) :: 1 ñ 1 = 1
F2:
Each node has a unique state inv. á å s : state(P,s) :: 1 ñ = 1
F3:
Each node executes one action at a time inv. á å t,a : action(P,t,a) :: 1 ñ = 1
F4:
Actions are never "in the past" inv. action(P,t,a) Ù a ¹ ^ Ù gclock(T) Þ T £ t
F5:
Message values are unique across the network inv. message(P,Q,t,m) Ù message(P',Q',t',m') Þ ((P,Q,t) = (P',Q',t') Û m = m')
F6:
Messages are never "in the past" inv. message(P,Q,t,m) Ù gclock(T) Þ T £ t
An alternate, but equivalent, formulation of F4 and F6 is useful:
We say that an action is enabled when the system clock has reached the action activation time and there are no messages to be delivered at the current time. A node is halted if its action is ^. Formally, enabled(P,t,a) º action(P,t,a) Ù gclock(t) Ù á " Q,m :: Ømessage(Q,P,t,m) ñ Ù a ¹ ĥ alted(P) º á $ t :: action(P,t,^) ñ Since many details of the networkÕs computation are not relevant to the simulation program, we encapsulate them using several functions assumed to be available to the program: u(P,s,m) returns the state of node P resulting when a message m is absorbed in state s.
e(P,t,s,a) returns the time when the action immediately following a will be executed given that a is executed on node P at time t in state s. The function e is strictly monotonic with respect to the argument t.
a(P,s,a) returns the name of the action which will be executed at node P upon the completion of the action a in state s.
c(P,Q,s,a) is true if node P sends a message to node Q as a result of executing the action a in state s.
l(P,Q,t,s,a) returns the delivery time for a message sent by node P to node Q as a result of executing action a in state s. Because a message is sent at the completion of the initiating action, l(P,Q,t,s,a) must exceed e(P,t,s,a).
v(P,Q,s,a) returns the contents of the message sent to node Q by node P when executing the action a in state s.
s(P,s,a) returns the new state of node P resulting from executing the action a in state s.
We now turn our attention to the problem of describing the assumptions we make about the behavior exhibited by the network to be simulated. As a means of organizing ourselves, we discuss valid changes to the different components of the network state separately. Properties involving several state components are discussed when first encountered.
Time
In the real world, time advances in one unit intervals 2 . If we impose this constraint in our specification, we will make it impossible for the program to advance time in larger increments whenever there is no network activity at any node. To avoid this unnecessarily strong condition, we simply restrict time from moving backwards, leaving the increment unspecified:
This formulation also allows the movement of time to cease (because it is a safety property and not a liveness property) when the simulation is Òfinished.Ó
Messages
Upon being created as a result of executing an action, a message travels through the network in some unspecified manner until its delivery time (F9). Messages are delivered in arbitrary order (we consider a message delivered when it is deleted by the receiving node), and the delivery of a message results in an atomic update of the state of the recipient (F10, F12). Messages can be created only by executing an action (F8, F11).
F8:
Messages are only created by executing actions action(P,t,a) Ù á set Q,t',m : message(P,Q,t',m) :: m ñ = M unless Øaction(P,t,a) Ú á set Q,t',m : message(P,Q,t',m) :: m ñ Ì M
F9:
Messages exist until their delivery time message(P,Q,t,m) unless message(P,Q,t,m) Ù gclock(t)
F10:
Messages are absorbed at the time of their delivery message(P,Q,t,m) Ù gclock(t) unless Ømessage(P,Q,t,m) Ù gclock(t)
F11:
Messages created at the current time are to be delivered in the future Ømessage(P,Q,t,m) Ù gclock(t) unless Øgclock(t)
F12:
Messages exist until incorporated into the state of the receiving node state(P,s) Ù gclock(t) Ù á set Q,m : message(Q,P,t,m) :: m ñ = M Ù M ¹ {} unless á $ m : á set Q,m' : message(Q,P,t,m') :: m' ñ = M-{m} :: state(P,u(P,s,m)) ñ
Actions
Although it is simpler to think of actions as being atomic and instantaneous, they in fact have a duration which must be modeled in the simulation. A simple way to capture this is to assign each action a starting time, with the difference in starting times between two consecutive actions being equal to the duration of the first (F13, F14)Ñwe include the time required to process incoming messages in the duration of the first action following their arrival. The effect of each action is atomic (F15). We require terminated nodes to remain so forever (F16) Ñ and we assume that messages whose delivery time coincides with the execution time of an action are absorbed before the action executes.
F13:
An action continues to exist until it is time to execute it action(P,t,a) Ù a ¹ ^ unless action(P,t,a) Ù a ¹ ^ Ù gclock(t)
F14:
An action continues to exist until it is enabled action(P,t,a) Ù a ¹ ^ Ù gclock(t) unless action(P,t,a) Ù a ¹ ^ Ù gclock(t) Ù á " Q,m :: Ømessage(Q,P,t,m) ñ
F15:
An enabled action continues to exist until executed atomically state(P,s) Ù enabled(P,t,a) unless action(P,e(P,s,a),a(P,s,a)) Ù state(P,s(P,s,a)) Ù á " Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ
F16:
A terminated node remains terminated stable halt(P)
State
The only events which can change the state of a node are the delivery of a message or the execution of an action. Note that these transitions have already been described (F12 and F15). All that remains is to forbid any other changes to the state.
F17:
The state does not change in the absence of work to do state(P,s) unless state(P,s) Ù á $ Q,t,a,m : gclock(t) :: enabled(P,t,a) Ú message(Q,P,t,m) ñ
Progress
The final section of the initial specification contains the liveness properties which describe the state transitions which are required of the simulation program. Only two progress properties are needed to describe the required state transitions, messages must be delivered, and actions must be executed:
F18:
Messages must be delivered message(P,Q,t,m) Ømessage(P,Q,t,m)
F19:
Actions must be executed action(P,t,a) Ù a ¹ ^ Øaction(P,t,a)
Other aspects of the programÕs behavior (such as moving time forward) are implied by combining these progress properties with the earlier stated safety properties. For example, if an action has an execution time in the future, then the clock must eventually move forward by properties F13 and F19.
This completes the specification of the program behavior. By carefully abstracting away irrelevant details, we are able to generate a specification which is both concise and clear.
Specification Refinement
In this section we refine the behavior specification up to the point where the generation of a Swarm program becomes trivial. Since the mechanics of how to carry out UNITY-style specification refinements may be found elsewhere in the literature [6, 8, 13, 23] we omit some of the details and all the proofs. We choose not to
skip the specification refinement altogether for the sake of completeness and because the emphasis on architecture independence leads to a refinement style different from that of other authors.
Like most contemporary authors, we view program derivation as a creative process and not as a mechanical substitute for design. Each specification refinement requires some inspiration, is motivated by design insights, and is carried out in a highly disciplined fashion. Although the syntactic form of a particular assertion may suggest a
certain type of refinement, such heuristicsÑwhich could ultimately lead to some form of automationÑplay only a secondary role in the refinement process today. Most often, the specification refinement is biased towards a very specific architecture at the expense of rendering all other architectures inappropriate. In our methodology, however, the emphasis is on a general, architecture-independent solution. By attempting to generate first an architectureindependent program, our methodology makes it possible to perform the specification refinement once and then to use the same initial program as the basis for deriving multiple architecture-specific programs.
A typical specification tends to include many safety properties and relatively few progress properties. The former place constraints on the solution space and, for reasons of completeness, are rather detailed and numerous.
The latter require that particular goal states must be reached but leave the details of how this is to be accomplished undefined. The purpose of specification refinement is to add sufficient detail on how progress is to be accomplished so as to make the writing of an appropriate correct program a trivial exercise. This means that relatively broad progress properties must be replaced by increasingly more specific ones. Changes in state representation often accompany these refinements and lead to more detailed formulations of the safety properties. 
enabled(P,t,a) Ù state(P,s) action(P,e(P,s,a),a(P,s,a)) Ù state(P,s(P,s,a)) Ù á " Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ Refinement 2. Decouple time movement from message and action processing. 
The similarity in the forms of properties F18. 
F20
: Property F18.2 requires only that messages be delivered before time moves forward, but does not provide any hints into the delivery mechanism. Property F12 provides the necessary insight, in that it requires that messages be delivered and absorbed in a single atomic step. This suggests the obvious refinement of replacing F18.2 with a progress property having the same form as F12. As a consequence of this, F19.2 becomes unnecessary.
Refinement 6. Add upper bound on time increment.
While F20 provides a lower bound on the time increment (namely 1), it does not include the upper bound.
Such an upper bound is provided by F21', since the time may never exceed the earliest message or action time. By folding F21' into F20, the progress property expresses all the constraints on the clock movement, which is convenient when the time comes to write the abstract program. F20 is replaced by F20.1, which requires that the clock move forward whenever the earliest work to be done is in the future.
The Final Specification.
At this point, the three progress properties describe transformations which can easily be considered atomic, which was our goal in the refinement process. The complete specification is reproduced below. inv. á å t,a : action(P,t,a) :
state(P,s) Ù enabled(P,t,a) unless action(P,e(P,s,a),a(P,s,a)) Ù state(P,s(P,s,a)) Ù á " Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ F16:
action(P,e(P,s,a),a(P,s,a)) Ù state(P,s(P,s,a)) Ù á " Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ
Initial Program
The final specification can be easily translated into a correct Swarm program. In this section we introduce the Swarm notation [20] and show how to generate an initial Swarm program from the specification. Swarm is more expressive than UNITY but has a very similar proof logicÑthe differences between the two proof logics have to do with the definition of atomic actions and with the domain over which proofs must be carried out. While UNITY assumes a fixed set of variables and a fixed set of statements, Swarm manipulates tuples and transactions which may be created dynamically and provides an interesting form of dynamic synchrony. These features allow
Swarm to model in a simple and direct way complex systems involving multiple programming paradigms, heterogeneous architectures, and specialized devices. Swarm is an ideal vehicle for exploring architecture-directed refinement. Even though we occasionally use UNITY for problems that exhibit a static structure, we prefer working with the Swarm notation in order to ensure that our results are relevant for a broad class of design problems.
The Swarm Programming Notation
The primary means for communication among the concurrent components of a Swarm program is a common, content-addressable data structure called a shared dataspace reminiscent of the Linda [5] tuple-space.
Elements of the dataspace may be examined, inserted, or deleted by programs. The model partitions the dataspace into three subsets: TPS, the tuple space (a finite set of data tuples), TRS, the transaction space (a finite set of To illustrate the Swarm programming notation we digress briefly from the distributed simulation problem to a toy example which, as will later become clear, has many elements in common with one of the two distributed simulation programs we will produce in the next section. In this toy program at most N timers are incremented in lockstep fashion. Each timer i is incremented modulo some overflow value ovr(i). A timer may be brought on line any time during the computation, but eventually all timers mark time in step with each other. To accomplish this, all timers are reset to zero whenever any one of them is reset to zero. As a result, all active timers count modulo
Construct a timer.
We begin by first considering the case of a simple timer with identifier i. We can represent the current state of this timer as a tuple time(i,Êv), where v is the timer's current value. The transaction which increments and resets the timer is Timer(i,ovr(i)). A timer is activated by inserting in the dataspace (initially or during the computation) the corresponding data tuple and transaction.
Define the timer's behavior. A transaction stored in the dataspace is simply a name for an atomic transformation of the dataspace. The transactionÕs behavior is defined separately as the composition of one or more subtransactions. A subtransaction consists of a dataspace query, which binds some set of existentially quantified local variables whose scope extends over the entire subtransaction, followed by an action which modifies the contents of the dataspace by inserting or removing entries if the query succeeds. (Notationally, the query and action are separated by Ò®Ó, we use the comma for logical and (Ù), and sub-transactions are separated by Ò||Ó.) By definition, deletions are performed before insertions. The query can be any arbitrary predicate over the dataspace, similar to a Prolog goal, and may check for the presence (or absence) of specific entities in the dataspace.
Additionally, a query can make use of one of five special predicates, AND, NAND, OR, NOR, or TRUE, whose truth value is computed from the success or failure of all regular queries executed in parallel (a regular query is one which does not contain one of the special predicates). Thus, the special predicate AND evaluates to true when all of the regular queries present in the transaction succeed. The semantics of transaction execution are similar to those for a single subtransaction, except that the queries for all subtransactions are evaluated in parallel, followed by the deletions and then the insertions appearing in the actions of those subtransactions whose queries succeeded.
In our example, we can specify the behavior of an individual timer by introducing the following transaction type definition (the reason for dividing the first two subtransactions will be made clear later):
The first subtransaction consists of a regular query which checks whether or not the timer needs to be reset and has a null action (skip). The variable t, which is local to this subtransaction, is bound by finding in the dataspace a tuple of type time whose first component contains the constant id. The success of the first subtransaction is communicated to the other subtransactions via the special predicate. Since OR succeeds whenever any regular query executed in parallel evaluates to true, the second subtransaction resets the timer by deleting the tuple time(id,t), independently found by its own query and marked for deletion by a dagger, and by inserting the tuple time(id,0). Similarly, the third subtransaction uses NOR (i.e., not OR) to determine if the timer can be incremented by one unit. The fourth subtransaction recreates the timer (which otherwise would be implicitly deleted). The special predicate TRUE (which always succeeds) ensures that the query associated with this subtransaction becomes a special query and is therefore not considered when OR and NOR are evaluated.
Establish lockstep execution. The requirement for lockstep execution can be expressed in Swarm using the third type of dataspace entity, the synchrony relation. Two timers i and j can be made part of the same synchronic group by inserting into the dataspace the following synchrony relation entry: 3 Timer(i,ovr(i))~Timer(j,ovr(j))
Recall that a set of transactions present in the dataspace and closed under the reflexive transitive closure of the synchrony relation is called a synchronic group, and that whenever a transaction is selected for execution, the entire synchronic group to which it belongs is executed, and all the subtransactions for all transactions in the group are executed together as if they were part of a single larger transaction. An interesting consequence of these semantics is that the special predicates are now evaluated with respect to the regular queries of the entire synchronic groupÑwe had this in mind when we decided to use special queries in the definition of Timer above. Consequently, the special predicate OR evaluates to true whenever the query of the first subtransaction in either Timer(i,ovr(i)) or Timer(j,ovr(j)) succeeds, indicating that both timers must be reset.
Swarm Proof Logic
Despite fundamental notational and computational differences between UNITY and Swarm, the Swarm proof logic [9, 21] is identical to the UNITY proof logic except for the proof obligations associated with the unless and ensures properties. To prove (p unless q) one must show that If p is true at some point in the computation and q is not, then executing any synchronic group either maintains p or establishes q.
In the more complicated case of (p ensures q) one must show that in addition to the condition above the following is also true If p Ù Øq is true, there exists a transaction t such that every synchronic group containing t will establish q when executed. The fairness assumption guarantees that t will eventually be selected and the synchronic group to which it belongs at the time will establish q.
It should be clear to the reader familiar with UNITY that these conditions reduce to those of UNITY when the synchrony relation is empty and the set of instantiated transactions remains constant. In this case, each transaction corresponds exactly to a UNITY statementÑexcept for some technical details having to do with the fact that UNITY statements are deterministic while Swarm transactions are not.
Abstract Program
Returning to the simulation problem, the three progress properties in the final specification (F18.2.1, F19.3, and F20.1) suggest an abstract program having three transaction types: one to execute actions, one to move the clock, and one to deliver messages. The remainder of this section gives these transactions. Informally, each transaction consists of two subtransactions, one to establish the required progress condition, and one to continue the computation. The transactions are given without proof. Accompanying each transaction is a list of the properties which directly constrain its form, and which are therefore likely to be affected by any refinement of the transaction.
We make use of these characteristic properties in the program refinement stages as a heuristic for reducing the amount of formal proof required. Property F18.2.1 requires that a messages with a delivery time equal to the current simulation time be incorporated into the state of the destination node. We implement this as a transaction of type gdeliver, which does this work for all messages at all nodes. gdeliver º P,Q,t,m,s : gclock(t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® gdeliver The program equates the predicates state(P,s) and message(Q,P,t,m) with corresponding tuples in the Swarm dataspace. One instance of this transaction type must be included in the initial dataspace configuration. The characteristic properties for this transaction are F2, F9, F10, and F12. Property F19.3 requires that the execution of an enabled action result in the atomic update of the node's state, the creation of the next action, and (possibly) the sending of messages to other nodes. To accomplish this, we introduce a transaction type task, having the same form as action, with the obvious relationship:
F22:
inv.
[task(P,t,a)] Û action(P,t,a)
Because this transaction affects the entire program state, it is necessary to verify that it violates none of the safety properties of the specification. A brief perusal of the specification identifies F2, F3, F13, F14, F15, and F16 as the characteristic set for this transaction.
task(P,t,a) º s : a ¹ ^, gclock(t), á " Q,m :: Ømessage(Q,P,t,m) ñ, state(P,s) ® task(P,e(P,s,a),a(P,s,a)), state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) Finally, property F20.1 requires that time move forward when all the work to do is in the future. This is accomplished by introducing a transaction of type gtick which examines the global program state and moves the clock forward when there are no tasks or messages to be processed at the current time.
One instance of this transaction type must be included in the initial dataspace configuration. The characteristic properties for the transaction are F1, F7, F10, F12, F11, F14, F15, and F21.
Architecture-Directed Program Refinement
The abstract program definitely solves the simulation problem. However, the widespread use of global data items makes it unsuitable for implementation on any truly distributed architecture. Our refinement methodology now turns to address this problem. In the next two sections, we give formal descriptions of the constraints imposed on programs by two example architectures, and then use these constraints to derive a solution specifically tailored to each architecture.
Consensus Bus Architecture
Our first example architecture is a classic distributed memory message-passing architecture with a specialpurpose device which makes it particularly well-suited for use in distributed simulation. The target machine consists of a number of identical components called sites. Each site h contains a controller k, a processor e, and two memory units, r (registers) and m (main memory). We assume that all entities in the physical system are assigned distinguishable identifiers which we can use to refer to each component, and let H denote the set of site identifiers.
The sites are connected by a data bus D and a consensus bus C. Both the controller and the processor are sequential machines capable of executing one operation at a time. At each site, the two processing units run asynchronously; that is, they do not share a common clock. The registers are used to allow the controller and processor at a single site to share a limited amount of data. Reads and writes to the registers are atomic. Figure 1 shows a stylized representation of a single site.
The data bus is the primary mechanism for sharing information between sites. This bus connects the processors and main memories from all sites, creating a limited form of distributed shared memory. Using the data bus, processors can write to, but not read from, memories at other sites. In a sense, the data bus implements a message-passing channel which allows processors to pass data amongst themselves by leaving messages in mailboxes. The local memories can be read only by the processor at the same site, no other memory accesses are permitted. In particular, the controllers cannot access the main memories, and the register units can only be accessed by the processing units at the same site. All accesses to the main memories (both reads and writes) are atomic.
The consensus bus is a specialized hardware device which allows all the controller elements to share a limited amount of information in the following manner. Logically, during the execution of a single step, the controllers at each site provide to the bus a data value. The consensus bus computes a hardwired function of these values, the result of which is then provided to the controllers and can be used in the remainder of the step. The result is a synchronous, lock-step execution by the controllers. Allocation Constraints. Now, we can define formally a site h as a five-tuple consisting of the site identifier h.id and the identifiers for the individual hardware components associated with the site: a controller h.k, a bank of registers h.r, a processor h.e, and a memory h.m. We also introduce the set K of controller identifiers, the set R of register identifiers, the set E of processor identifiers, and the set M of memory identifiers. Since each hardware component belongs to one and only one site we find it convenient to introduce a function:
which given the identifier of a component, returns the identifier for the site to which the component belongs.
Next we consider formally the constraints on allocation of programs and data to sites. For this purpose, we can introduce a predicate locus(i,j), where i is from the set of transactions or the set of tuples, and j is from the set of identifiers for hardware components. The predicate locus(i,j) is true if the dataspace element i is allocated to component j, and is initially restricted so that transactions are mapped to processing units, and data tuples to memory units:
This very general definition of the locus predicate can be further constrained to reflect the specifics of the simulator hardware. In the present case, the sequential nature of the processing units constrains the allocation of software to hardware, allowing at most one transaction to be present at any time on each of the processing units:
The notation [t] means Òthe transaction instance t exists in the dataspace.Ó) Also, because the processors run asynchronously with respect to the rest of the system, transactions which are placed on them cannot be part of any synchronic group containing transactions allocated to other processing units:
Finally, since controllers execute synchronously, transactions which are placed on them must all be part of the same synchronic group.
Access Restrictions. We turn now to the constraints on memory access by transactions. In particular,
we wish to constrain the locations of reads and writes made by transactions. Our approach is to introduce auxiliary tuples which record the reading or writing of a variable by a process running on one of the processing units. We can describe the read/write constraints in terms of invariants over two auxiliary tuples, raccess(i,j) and waccess(i,j), where i KÊ¡ ÊE, and j KÊ¡ ÊRÊ¡ ÊEÊ¡ ÊM (we include the possibility that a transaction is read or written). The presence of one of these tuples in the dataspace indicates that a transaction with locus i has read (or written) an entity with locus j. To prove that a transaction satisfies one of these constraints, all subtransactions are augmented to insert a raccess tuple whenever a tuple appears in the query of the subtransaction, and to insert a waccess tuple whenever a tuple appears in the action. The access constraints can now be expressed in terms of three invariants.
C C 4:
Transactions on the processors can only read from memories located at the same site
Transactions on processors can write to any memory, to registers at the same site, and to the processor itself (to change execution state)
Transactions on controllers can read and write only registers at the same site or the controller itself
Consensus bus. The consensus bus is actually a specialized device which synchronizes the execution of all controllers. Additionally, at each step the bus accepts a boolean value from each controller and returns to all controllers the result of applying the logical and across all the boolean values received. There are two reasons for introducing the consensus bus in our illustration. First, from a practical viewpoint, such a bus is easy to construct and matches the needs of the simulator. Second, from a pedagogical perspective, the bus allows us to illustrate the formalization of a highly specialized device and the expressive power of the synchronic group construct in Swarm.
Returning to the formalization of the constraints imposed by the consensus bus, we take advantage of the built-in consensus feature associated with synchronic groups. It allows us to reduce the effect of the consensus bus to restricting transactions allocated to the controllers from using any of the special queries except for AND and NAND (and of course, TRUE). To accomplish this, we augment each subtransaction on controller k that uses OR (or NOR), e.g., || : OR, query ® action so as to insert an auxiliary tuple that records the improper use of the consensus bus, i.e., || : OR, query ® action, invalid_consensus() and add the following proof obligation:
We rely here upon C C 3, which requires all transactions allocated to controllers to execute synchronously, allowing us to make use of the Swarm consensus mechanism. This formulation is in fact stronger than a syntactic restriction; for example, if the query associated with the OR always evaluates to false, we can prove that no improper use of the consensus bus takes place in spite of the fact that the syntax alone suggests otherwise.
The formalization of architectural constraints clearly depends upon the computational model being used.
The fact that Swarm already provides a form of built-in consensus makes the formalization trivial. This does not mean that our assertional method would be inappropriate otherwise, but it does mean that the formalization, by necessity, would be more complex. In the absence of the built-in consensus, auxiliary tuples would be needed to keep track of the booleans supplied by each controller and the values returned by the bus. The relation between these values would, of course, be constrained by an appropriate invariant.
Initial Mapping. We now return to the task of mapping the abstract program onto this architecture.
We assume that there are sufficient processors to associate a distinct one with each node. Obviously, not all aspects of the program state can be allocated without violating the architectural constraints. This fact will drive the derivation process as we move to resolve those allocation decisions which cannot be made at this point.
Nevertheless, certain allocations are required by the nature of the architecture; for example, each task transaction must be allocated to a processor and must run asynchronously, each node's state should be placed into the processor's local memory, and messages, which must be sent from one processor to another, should also be allocated to Note that this allocation satisfies C C 1 and C C 2 for the entire program state, and that C C 3, C C 6 and C C 7 are satisfied vacuously (since no transaction is allocated to the controllers). However, there is no allocation of either gdeliver or gtick that satisfies the requirements. Obviously, neither can be placed on the processors without violating C C 1, and allocating them to the controllers would violate C C 6, since each needs access to information located in the site's main memory. Further, the gclock tuple cannot be allocated to any memory or register bank, as it must be read by every processor, violating either C C 4 or C C 6. The remainder of the refinement process is guided by these failures.
Refinement 1. Allocate gdeliver.
The problem of resolving the allocation of gdeliver is most easily solved, so we address it first. Since the work done by the transaction is clearly local, the obvious solution is to distribute the processing to each site, introducing a local deliver transaction, with one copy of the transaction for each node. The gdeliver transaction type is replaced by a deliver transaction type, formed by parameterizing gdeliver with the node id, as follows:
deliver(P) º Q,t,m,s : gclock(t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® deliver(P)
We require that there be a deliver transaction on each processor, i.e., C C 8:
inv. á $ Q,t,a,m :: (action(Q,t,a) Ù a ¹ ^) Ú message(Q,P,t,m) ñ Þ deliver(P) 4 This formulation requires that a deliver transaction exists if there is the possibility of any new messages appearing in the future, while allowing the program to terminate when the simulation is complete. Additionally, we introduce further refinements on locus to guarantee that the transaction satisfies the access restrictions (C C 4 and C C 5):
inv. locus(deliver(P),i 1 ) Ù locus(task(P,t,a),i 2 ) Þ site(i 1 ) = site(i 2 )
Re-Verification Obligations. All the safety properties associated with gdeliver (F2, F9, F10, and F12) continue to be satisfied, since no new state transitions are introduced by this refinement. The progress property (F18.2.1) also continues to hold. This can be seen by observing that any transition which would have been made by the global transaction is performed by one of the local transactions (specifically the one assigned to the destination site); and since the local transactions are always re-created, the statement needed to make the transition always exists.
Outstanding Violations. This allocation violates C C 1 as task is already allocated to the processor, but we will deal with this violation later. Additionally, neither gclock nor gtick can yet be allocated.
Refinement 2. Allocate gclock(T).
We now turn to the problem of allocating the clock. Since task reads gclock, C C 4 applies, requiring that gclock be located on every processor which has a task, and that the tuple be located in either registers or memory.
This suggests that a local clock should be maintained at each site, with all clocks running in lockstep. We introduce a tuple of type clock, with one tuple for each node. The tuple replaces the gclock tuple type. All three transactions must be re-written to make use of the new data representation.
task(P,t,a) º s : a ¹ ^, clock(P,t), á " Q,m :: Ømessage(Q,P,t,m) ñ, state(P,s) ® task(P,e(P,s,a),a(P,s,a)), state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) gtick º P',T,T' : clock(P',T), á " P,Q,t,a,m : message(Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t > T ñ, T+1 £ T' £ á min P,Q,t,a,m : message(Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t ñ ® á P : P Î nodes :: clock(P,T) , clock(P,T') ñ || TRUE ® gtick deliver(P) º Q,t,m,s : clock(P,t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® deliver(P)
We introduce the obvious requirements that there be exactly one clock for each processor, and that all clocks carry the same time.
inv. á å T : clock(P,T) :: 1 ñ = 1 C C 10: inv. gclock(T) º á " P :: clock(P,T) ñ Additionally, to resolve the violation of C C 4 by task and deliver, we must allocate the clock tuple to either registers or memories at the same site as the two transactions. That is, we introduce the following additional restrictions on locus :
inv. locus(clock(P,T),i 1 ) Ù locus(task(P,t,a),i 2 ) Þ site(i 1 ) = site(i 2 )
Re-Verification Obligations. The transactions task and deliver are actually unaffected by this transformation since gclock(T)ÊºÊclock(P,T). As far as gtick is concerned, it clearly satisfies C C 10, and consequently, all other behavioral obligations. Additionally, we can now show that the task and deliver transactions access only local data.
Outstanding Violations. Both task and deliver now satisfy all architectural constraints except for C C 1. gtick cannot be allocated without violating one of C C 4-C C 6.
Refinement 3. Allocate gtick.
As we turn to address the problem of allocating gtick, note that the Ò"Ó in the query suggests using the consensus bus, so an allocation to the controllers is proposed. With this in mind, we introduce a distributed processing scheme, with one tick transaction at each site. The refinement is presented in two steps. First, we present a formulation with a separate subtransaction for each processor P. Each subtransaction checks locally for work to be performed. The global check is then done using an AND special predicate query, which succeeds only when each of the local subtransactions succeeds (that is, all the work at each site is in the future). Additionally, we must select a time increment that does not violate F21. Since we cannot pass any additional information between controllers, and since all sites must have the same time value, 1 is the only reasonable choice.
gtick º á || P : P Î nodes :: T : clock(P,T), á " Q,t,a,m : message(Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t > T ñ ® skip || T : AND, clock(P,T) ® clock(P,T) , clock(P,T+1) ñ || TRUE ® gtick Now we can separate the gtick transaction into a collection of local, synchronous tick transactions, one for each processor, with the AND special query computed by the consensus bus. The parameter to the local transaction is the P from the subtransaction generator. This transformation retains the semantics of the previous transaction, since the individual tick transactions are in the same synchronic group. This gives us the following definition for the new transaction:
tick(P) º T : clock(P,T), á " Q,t,a,m : message(Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t > T ñ ® skip || T : AND, clock(P,T) ® clock(P,T) , clock(P,T+1) || TRUE ® tick(P)
To maintain the invariant that all clocks have the same value, we must require that, if any tick transaction remains in the dataspace, then all transactions remain, i.e., C C 11: inv. tick(P) Û tick(Q)
Since we make use of the consensus bus, we will need to allocate the transactions to the controllers: Outstanding Violations. We are left with a violation of C C 6 by tick, since it must determine the existence or non-existence of messages and tasks, neither of which are allocated to the registers. The violations of C C 1 by task and deliver are also outstanding.
Refinement 4. Detecting absence of work.
To remove the violation of C C 6 by tick, we introduce two new tuple types, allocated to the registers, which contain sufficient information to allow tick to detect that there is no remaining work to be done at a site at the current time. We introduce the tuple types event(P,t) and no_msg(P), having the meanings Òthe next task to execute for node P is at time t,Ó and Òthere are no remaining unabsorbed messages for node P at the present time,Ó respectively. A C 13: inv. locus(event(P,t),i) Þ i Î R A C 14: inv. locus(event(P,t),i 1 ) Ù locus(task(P,t,a),i 2 ) Þ site(i 1 ) = site(i 2 ) A C 15: inv. locus(no_msg(P),i) Þ i Î R A C 16: inv. locus(no_msg(P),i 1 ) Ù locus(task(P),i 2 ) Þ site(i 1 ) = site(i 2 )
The task transaction maintains the event tuple, while deliver and tick cooperate to update no_msg. task(P,t,a) º s : a ¹ ^, clock(P,t), no_msg(P), state(P,s) ® event(P,t) , task(P,e(P,s,a),a(P,s,a)), á : a(P,s,a) ¹ ^ :: event(P,e(P,s,a)) ñ, state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) tick(P) º T : clock(P,T), no_msg(P), Øevent(P,T) ® skip || T : AND, clock(P,T) ® clock(P,T) , clock(P,T+1), no_msg(P) || TRUE® tick(P) deliver(P) º Q,t,m,s : clock(P,t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || t : clock(P,t), á " Q,m :: Ømessage(Q,P,t,m) ñ ® no_msg(P) || TRUE ® deliver(P)
Re-Verification Obligations. We must show that the new forms of each transaction satisfy the corresponding characteristic properties. We can also show that there are now no access violations by tick.
Proof Outline. To prove the functional correctness of this refinement, we must show that the new representation carries the same semantics as the original. In particular, we must show that tick cannot incorrectly detect the absence of work. This can be accomplished if the refined transactions maintain the following invariants, which equate the two data representations: C C 13: inv. event(P,t) º á $ a : task(P,t,a) :: a ¹ ^ ñ C C 14: inv. no_msg(P) Ù clock(P,t) Þ á " Q,m :: Ømessage(Q,P,t,m) ñ
The truth of these invariants is clear from the text of the transactions. Since the no_msg tuple is removed by tick when the clock is advanced (to maintain C C 14), we require a progress property that guarantees that the removal of the last message for a node results in the re-insertion of no_msg. That is, we require: C C 15: no_msg(P) detects clock(P,t) Ù á " Q,m :: Ømessage(Q,P,t,m) ñ This property follows immediately from the invariance of C C 14 and the text of deliver.
Outstanding Violations. Only the violation of C C 1 by task and deliver is outstanding.
Refinement 5. Satisfy uniprogramming requirements for processors.
Since the deliver and task transactions require access to both the registers and memory at a site, both must be allocated to the processors. However, at most one transaction can be present on any processor. To satisfy C C 1, we must combine these 2 transactions in some way. This can be done either by combining the two transactions into a single transaction that does both, or by alternating them. We opt for the latter, since this more closely reflects the approach that might be used in a traditional programming language. We modify task and deliver as follows: task(P,t,a) º s : a ¹ ^, clock(P,t), no_msg(P), state(P,s) ® event(P,t) , deliver(P,a(P,s,a)), á : a(P,s,a) ¹ ^ :: event(P,e(P,s,a)) ñ, state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: message(P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) deliver(P,a) º Q,t,m,s : clock(P,t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)), deliver(P,a) || NOR ® no_msg(P) || t : NOR, clock(P,t), event(P,t) ® task(P,t,a) || T,t : NOR, clock(P,t), ((event(P,T) Ù T ¹ t) Ú a = ^) ® deliver(P,a) with the additional requirement that only one of the two transactions is present at any given time: C C 16: inv. Ø(task(P,t,a) Ù deliver(P,a'))
We have added a parameter to deliver to keep track of the action which should be performed when all messages have been delivered. There is no need to include the time of the next action, since that is already available from the event tuple.
Proof Obligations. This transformation does not affect any of the safety properties to be satisfied by task and deliver. We must, however, show that the transactions satisfy F18.2.1, the progress property which was previously satisfied by deliver, and F19.3, the progress property for task. Finally, since the new transactions satisfy C C 1, all architectural constraints are met at this point.
Proof Outline. To prove that messages are eventually delivered (F18.2.1), we must show that if there is a message to be delivered at the current time, then there is a deliver transaction to process it, that is inv. clock(P,t) Ù message(Q,P,t,m) Þ á $ a :: deliver(P,a) ñ This invariant can be verified by examining the program text: deliver does not move the time and does not create messages; tick does not create messages, and only advances the time when there are no messages and no task; and once a task is executed, it creates the next deliver transaction, with all messages being in the future. Since the new form of deliver clearly performs the necessary state transitions, then we have F18.2.1.
Additionally, we must show that if there is an event scheduled at any given time, there will eventually be a task transaction to perform it, e.g.,
The unless is clearly maintained by the entire program, and the leads-to is established by the third subtransaction of deliver.
This completes the derivation of the simulation program for execution on the consensus bus architecture.
To simplify the presentation, we did not consider the problem of termination, although it should be clear that the issue could have been addressed at the expense of slightly more complex formulations of some properties. We now turn our consideration to solving the problem on a very different architecture, to show the flexibility of this approach.
Ring Architecture
Our second target architecture is a traditional ring. The nodes in the ring are multiprogrammable, general purpose processors containing a local memory. Each processor is assumed to have a unique identifier from the set {Ê0Ê..ÊN-1Ê}, where N is the size of the ring and equals the number of nodes in the simulated network. Identifiers are assigned sequentially around the ring. The processors in the ring are connected by one-way, asynchronous communication channels (communication is clockwise around the ring). This is a true distributed memory architecture, so a processor can only directly read from its local memory. We model message passing as writes by one processor to another processor's memory; because communication proceeds clockwise around the ring, a processor can write both to its local memory, and to the memory of the processor to its immediate right in the ring.
We assume that all memory accesses are atomic. To simplify discussions about the ring, we introduce two definitions.
Allocation Constraints. As with the consensus bus, we express the mapping of program elements to hardware using the predicate locus(i,j), having the meaning that the transaction or tuple i is to be allocated to the processor with identifier j.
Access Restrictions. We introduce two auxiliary tuple types raccess(i,j) and waccess(i,j) to record reads and writes. Both i and j are integers in the range 0..N-1, and have the meaning that a transaction resident on node i of the ring has read (or written) a tuple or transaction on node j. The informal access constraints can be formally expressed using two predicates over these auxiliary tuples.
Additional Restrictions. The only additional constraint introduced by the ring architecture is the requirement that the nodes execute asynchronously. This can be succinctly stated within the Swarm notation as:
Initial Mapping. Very little of the program's state can be allocated without further refinement. While we can be certain that we want each node's state to be allocated to the same location as the corresponding task transaction (A R 1), no other decisions are possible.
Several problems are immediately obvious. First, gclock cannot be allocated to any node in the ring, since its value is read by every task. Further, neither of the global transactions (gdeliver and gtick) can be placed until gclock is dealt with. Finally, the restriction on writes (C R 2) makes it impossible for a transaction to send a message to any transaction allocated to a processor other than the one to its right. Until the allocation decisions are made, it
is not possible to prove that any of the transactions satisfy the access constraints (C R 1 and C R 2), since each transaction reads or writes the clock.
Refinement 1. Allocate gdeliver.
The problem of allocating gdeliver is the most easily solved. By C R 2, any transaction which updates a state tuple must be located on the same node as the state tuple. As in the consensus bus architecture, this implies a distributed version of gdeliver, with one transaction for each P. The refinement is identical to that in the previous example, so we give it without further comment.
deliver(P) º Q,t,m,s : gclock(t), state(P,s), message(Q,P,t,m) ® message(Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® deliver(P) with C R 4:
inv. á $ Q,t,a,m :: (action(Q,t,a) Ù a ¹ ^) Ú message(Q,P,t,m) ñ Þ deliver(P) A R 2: inv. locus(deliver(P),i 1 ) Ù locus(task(P,t,a),i 2 ) Þ i 1 = i 2
Outstanding Violations. We still cannot allocate the gclock tuple, nor can we allocate the gtick transaction.
The violation of the write restrictions (C R 2) with regards to messages also remains, and is considered next.
Refinement 2. Add a Òcurrent locationÓ field to messages.
Obviously, messages must be sent around the ring, since we cannot send messages directly between nonadjacent processors. As a first step, we modify the form of messages to accommodate the routing process. We begin with a simple data refinement by adding to each message a processor identifier which gives the current location of the message, e.g., msg(P,Q,R,t,m)
with the meaning, ÒQ has sent a message with content ¡ o R; the message is currently at node P and must be delivered at time t.Ó Our coupling invariant (C R 5) states that a message exists if there is an analogous msg tuple located somewhere between the source and destination processors.
inv. message(Q,R,t,m) º á $ P : P twixt(Q,R) :: msg(P,Q,R,t,m) ñ Additionally, we require that there be at most one msg tuple for any message (C R 6), and that it be allocated to the processor simulating node P (A R 3):
inv. á å P : msg(P,Q,R,t,m) :: 1 ñ £ 1 A R 3: inv. locus(deliver(P),i 1 ) Ù locus(msg(P,Q,R,t,m),i 2 ) Þ i 1 = i 2 We update the program to use the new representation. Note that the sending of a msg by task is performed by writing the tuple directly to the destination processor.
task(P,t,a) º s : a ¹ ^, gclock(t), á " Q,R,m :: Ømsg(Q,R,P,t,m) ñ, state(P,s) ® task(P,e(P,s,a),a(P,s,a)), state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: msg(Q,P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) deliver(P) º Q,t,m,s : gclock(t), state(P,s), msg(P,Q,P,t,m) ® msg(P,Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® deliver(P) gtick º T,T' : gclock(T), á " P,Q,R,t,a,m : msg(R,Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t > T ñ, T+1 £ T' £ á min P,Q,R,t,a,m : msg(R,Q,P,t,m) Ú (task(P,t,a) Ù a ¹ ^) :: t ñ ® gclock(T) , gclock(T') || TRUE ® gtick Re-Verification Obligations. Obviously, the transactions satisfy the new architectural constraints (C R 5, C R 6). Since this is simply a data refinement, no new transitions are added to the program, so the program continues to satisfy the safety properties of the original specification. It is necessary to verify that the progress properties of the original program remain satisfied. In particular, we need to show that deliver continues to establish F18.2. Outstanding Violations. Neither gtick nor gclock have been allocated. In addition, task creates the new messages at the target processor, which is a violation of C R 2.
Refinement 3. Route messages around the ring.
The write access violation by task can be resolved by adding a transaction type which routes messages around the ring, and then modifying task to create new messages locally. We use a distributed solution, allocating one routing transaction to each processor. We add a new transaction type router(P) which simply forwards messages which are not yet at their destination. Its behavior is formally described as:
Messages eventually reach their destination msg(P,P,Q,t,m) msg(Q,P,Q,t,m)
Messages remain at their destination until they are absorbed msg(P,Q,P,t,m) unless á " R :: Ømsg(R,Q,P,t,m) ñ
The router transaction actually implements these new properties incrementally, by moving the messages one processor at a time, clockwise around the ring.
router(P) º Q,R,t,m : msg(P,Q,R,t,m), P ¹ R ® msg(P,Q,R,t,m) , msg(right(P),Q,R,t,m) || TRUE ® router(P) This transaction should be allocated to the same location as the analogous deliver transaction, and we must guarantee that the transaction continues to exist as long as there is the possibility that any messages may arrive needing to be routed, e.g.,
inv. locus(deliver(P),i 1 ) Ù locus(router(P),i 2 ) Þ i 1 = i 2 C R 9: inv. á $ Q,t,a,m :: (action(Q,t,a) Ù a ¹ ^) Ú message(Q,P,t,m) ñ Þ router(P)
Re-Verification Obligations. We must show that this new transaction does not violate any of the safety properties from the original specification. In particular, we can identify the properties affected by the transaction as F5, F9, F10, F11, and F12. The proofs are straightforward. Additionally, we must show that router establishes C R 7, and that this in turn allows us to conclude that F18.2.1 is satisfied by the new program.
Proof Outline. To prove that router establishes C R 7, we can show that the sum of the distances between the current locations of all messages in the system and their destinations never increases, and in fact will decrease.
Informally, this means that eventually, all messages will arrive at their destinations. Formally, router establishes the following progress property:
L2: gclock(t) Ù á å P,Q,R,t,m : msg(P,Q,R,t,m) :: dist(P,R) ñ = M Ù M > 0 gclock(t) Ù á å P,Q,R,t,m : msg(P,Q,R,t,m) :: dist(P,R) ñ < M where dist(P,R) is the number of processors between P and R moving clockwise around the ring. Since by F11 messages are not created with a time stamp equal to the current time, the metric M cannot increase. Using L2, we can prove by induction 5 the following progress property:
L3: gclock(t) á å P,Q,R,t,m : msg(P,Q,R,t,m) :: dist(P,R) ñ = 0 which states that eventually there are no messages with a current time stamp that are not at their destinations. Since by F20.1, the clock must eventually assume a value equal to the time stamp on any message, L3 gives us C R 7.
Additionally, it is clear that deliver establishes the following variant of F18.2.1, which states that messages which have arrived at their destination will eventually be absorbed, while allowing for the arrival of new messages (via router):
L4: gclock(t) Ù á set Q,m : msg(P,Q,P,t,m) :: m ñ = M Ù M ¹ {} Ù state(P,s) gclock(t) Ù (á $ m : á set Q,m' : msg(P,Q,P,t,m') :: m' ñ = M-{m} :: state(P,u(P,s,m)) ñ Ú state(P,s) Ù M Ì á set Q,m' : msg(P,Q,P,t,m') :: m' ñ)
Since the number of messages is bounded above (by F11), eventually the set of undelivered messages with current time stamps will reach a maximum, from which time it can only decrease. C R 7 guarantees that all messages eventually reach their destinations, and L4 requires that they eventually be delivered, which gives us F18.2.1.
Outstanding Violations. The allocation of gtick and gclock has not yet been resolved, and the violation of C R 1
by the query for outstanding messages in task remains.
Refinement 4. Refine time increment into search and update phases.
As in the consensus bus example, it is clear that gtick and gclock must be distributed. However, the absence of either a global control mechanism or global memory requires the approach to be truly distributed.
As a first refinement, we can break the work of gtick into two phases, a search phase which finds the earliest work, and an update phase which changes the current time. We find the following definitions useful at this point:
no_msg(R,t) º á " P,Q,m :: Ømsg(P,Q,R,t,m) ñ work(P,t) º á $ a,Q,R,m :: (action(P,t,a) Ù a ¹ ^) Ú msg(P,Q,R,t,m) ñ
The initial program refinement is simply to split gtick into two global transactions which match the two phases, satisfying the following requirements: The gsearch transaction is taken from property C R 13, and gupdate from C R 14 and C R 15. gsearch º t : t = á min P,T : work(P,T) :: T ñ ® gupdate(t)
Re-Verification Obligations. We need to prove that this pair of transactions satisfies the safety properties from the original specification. This is trivial for gsearch since it introduces no transitions within the original state (as a result, its characteristic property set is empty). Since gupdate performs the time change, it must be shown to satisfy the entire characteristic set for gtick, namely, F1, F7, F10, F11, F12, F14, F15, and F21. Additionally, we must prove that the new transactions satisfy F20.1, the progress property which motivated gtick.
Proof Outline. The proof that the refinement continues to satisfy F20.1 involves simply proving that C R 10-C R 15 constitute a refinement of F20.1. As a review, F20.1 states:
Informally, C R 11 requires that if there is work to be done, then the system is in a state which, by C R 13 and C R 14, will eventually cause the clock to be moved (C R 15). More formally, C R 11 implies the following lemma, which describes the legal states of this sub-system at any given time:
L5: inv. T = á min P,T' : work(P,T') ::
The proof follows from the general disjunction rule for leads-to 6 , as follows. L5 can be used to prove L6, which says that if T is the next time at which there is work, then the system will eventually set the clock to time T, i.e., L6: T = á min P,T : work(P,T) :: T ñ gclock(T) L6 follows from L5, C R 13, C R 14, C R 15, and the transitivity of leads-to. Since L6 implies F20.1, the refinement is proven.
Outstanding Violations. We have not yet allocated either of the new transactions or gclock. The read violation by task remains.
Refinement 5. Distribute the clock.
As a first step towards allocating gsearch, gupdate, and gclock, we distribute the clock, giving each processor a local copy which will be updated during the update phase. This refinement will allow us to prove that several of the transactions satisfy the access constraints, which has not been possible until now. In the consensus bus example, we were able to maintain an invariant which basically required that all local clocks have the same value. Since there is no global control mechanism available in this architecture, we elect to require a slightly weaker coupling invariant. Naturally, if a solution were to present itself which enabled us to maintain all the clocks in synchrony, it would satisfy the coupling invariant. Specifically, we will define the global time to be the minimum local time value, with the added restriction that there can only be at most two different local time values.
C R 16: inv. gclock(T) º T = á min P,T' : clock(P,T') :: T' ñ C R 17: inv. 1 £ á å T : á $ P :: clock(P,T) ñ :: 1 ñ £ 2
The clock tuples are allocated to the same processor as the task transaction for the same node:
inv. locus(clock(P,T),i 1 ) Ù locus(task(P,t,a),i 2 ) Þ i 1 = i 2
We re-write the transactions to use the new representation (router is not affected):
task(P,t,a) º s : a ¹ ^, clock(P,t), á " Q,R,m :: Ømsg(R,Q,P,t,m) ñ, state(P,s) ® task(P,e(P,s,a),a(P,s,a)), state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: msg(P,P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) deliver(P) º Q,t,m,s : clock(P,t), state(P,s), msg(P,Q,P,t,m) ® msg(P,Q,P,t,m) , state(P,s) , state(P,u(P,s,m)) || TRUE ® deliver(P) gsearch º t : t = á min P,T : work(P,T) :: T) ñ ® gupdate(t) gupdate(t) º clock(P,T) ® á P :: clock(P,T) ñ, á P :: clock(P,t) ñ, gsearch Proof Obligations. Since the refinement doesn't change the semantics of any of the transactions, it is not necessary to re-verify adherence to the original specification. This refinement allows us to prove that deliver only accesses local data (C R 1-C R 2), and so it satisfies the entire architectural specification.
Outstanding Violations. The allocation of gsearch and gupdate remains to be decided, and the read violation by task is still around.
Refinement 6. Distribute gupdate.
We now wish to distribute the update process to eliminate its violations of the specification. The obvious solution is to distribute the transaction, and pass control around the ring, allowing each processor to update its clock locally. We replace the gupdate transaction type with update. This transaction moves around the ring, updating the clock at each processor. We consider that the system is in the update phase whenever there is an update transaction at any site: Additionally, we allocate the update transaction to the same processor as the local clock:
A R 6: inv. locus(update(P,t),i 1 ) Ù locus(clock(P,T),i 2 ) Þ i 1 = i 2 We can now write the update transaction definition. The first subtransaction is simply a distributed version of the subtransaction from gupdate; the second subtransaction is from C R 21; and the third is from C R 22 and C R 14. This transaction now satisfies the entire specification.
update(P,t) º clock(P,T) ® clock(P,T) , clock(P,t) || P < N-1 ® update(right(P),t) || P = N-1 ® gsearch We also re-write the gsearch transaction to satisfy C R 20 and C R 13: gsearch º t : t = á min P,T : work(P,T) :: T) ñ ® update(0,t)
Proof Obligations. We must show that the collection of update transactions satisfies the progress properties (C R 14 and C R 15) that were originally satisfied by gupdate, and that there is no violation of the characteristic safety properties for gupdate. The former follows directly from the transitivity of leads-to, and the latter is clear from examining the text of the transaction. Additionally, we can show that update performs only local reads and right writes, satisfying C R 1 and C R 2.
Outstanding Violations. gsearch has not yet been allocated. The only access violations left to resolve are those of C R 1 by gsearch and task.
Refinement 7. Distribute search.
Our solution is to distribute the search process in such a way as to guarantee that messages are delivered to their destinations before the local clock is updated. This will allow us to replace the global message query in task with a local query, eliminating the access violation. We propose the creation of a search transaction which is passed from processor to processor. When received by a processor, the time stamp carried on the transaction should be compared against the earliest work which the processor knows about, and if the processor knows of earlier work to be done, it changes the time stamp before passing the transaction on. If the search process makes two passes around the ring, then we can guarantee that the delivery times for all messages are considered. We propose the following form for search:
having the meaning: the search process is at processor P, and at this point, the earliest work found is at time T. The transaction makes two passes around the ring; n contains the pass number.
As with the refinement to gupdate, we allow at most one search transaction to exist at any given time:
C R 23: á å P,n,T : search(P,n,T) :: 1 ñ £ 1
We introduce the obvious coupling invariant to map gsearch to search: C R 24: inv. gsearch º á $ P,n,T :: search(P,n,T) ñ and we do not allow the time stamp on the search transaction to increase: C R 25: search(P,n,T) unless á $ P',n',T' : search(P',n',T') :: T' £ T ñ
The allocation is the obvious one:
A R 7: locus(clock(P,t),i 1 ) Ù locus(search(P,n,T),i 2 ) Þ i 1 = i 2
To guarantee that messages are delivered before the local clock is updated, we will want the search transaction to ÒpushÓ messages around the ring. That is, messages should be forwarded before the transaction advances. This idea is captured formally in the following invariant, which states that on the second pass, all unreceived messages located at nodes ÒaheadÓ of the search transaction are in fact destined for processors ÒaheadÓ of the transaction: C R 26: inv. search(P,2,t) Þ (msg(Q,R,S,t,m) Ù Q twixt(P,N-1) Þ S twixt (P, N-1))
Since this invariant must hold as soon as the second pass begins, it in fact constrains the first pass, requiring that it be performed in a manner guaranteed to establish C R 26. The main significance of C R 26 is that it guarantees that the second pass will encounter every message. We introduce the function min_work(P), which simply computes the minimum time value for which work is known at processor P, i.e., min_work(P) º á min T' : work(P,T') :: T' ñ Additionally, we require the transaction to move clockwise around the ring, updating its time stamp to reflect the minimum time for which work has been encountered. The actual properties differ slightly depending on where the transaction currently resides: This gives us the following form for the search transaction:
search(P,n,T) º Q,R,t,m : msg(P,Q,R,t,m), R ¹ P, t £ T ® search(P,n,T) || NOR, P ¹ N-1 ® search(right(P),n,min(T,min_work(P))) || NOR, P = N-1, n = 1 ® search(0,2,min(T,min_work(P))) || NOR, P = N-1, n = 2 ® update(0,min(T,min_work(P)))
The first subtransaction serves to maintain C R 26 by making the NOR of the other three subtransactions false as long as there are still messages to be forwarded (by router); the second, third, and fourth subtransactions are from C R 27, C R 28, and C R 29, respectively. We must also revise update to reflect C R 22.1. update(P,t) º clock(P,T) ® clock(P,T) , clock(P,t) || P < N-1 ® update(right(P),t) || P = N-1 ® search(0,1,¥) Finally, C R 26 implies that when the update transaction arrives at a node all messages having an earlier time-stamp arrived already. The following invariant is maintained by the program: C R 30: inv. clock(P,t) Ù message(Q,R,P,t,m) Þ P = Q This allows us to modify task to check for messages locally, eliminating its read violation as well: task(P,t,a) º s : a ¹ ^, clock(P,t), á " Q,m :: Ømsg(P,Q,P,t,m) ñ, state(P,s) ® task(P,e(P,s,a),a(P,s,a)), state(P,s) , state(P,s(P,s,a)), á Q : c(P,Q,s,a) :: msg(P,P,Q,l(P,Q,t,s,a),v(P,Q,s,a)) ñ || NOR ® task(P,t,a) Proof Obligations. We will need to show that the search and update transactions, as modified, continue to satisfy the constraints which initially motivated their form, e.g., C R 10-C R 15, and C R 19-C R 22. Additionally, we can now show that search does not violate the read access restrictions (C R 1), so the program now satisfies the entire architectural specification.
Proof Outline. The proof that the update transaction is correct is straightforward. Similarly, the invariance of C R 30 is obvious from the program text. More interesting is the proof that search satisfies the previous specification, in particular the progress properties C R 13-C R 15. As with the refinement of gupdate in the previous section, the approach is to show that C R 25-C R 29 amount to a refinement of C R 13, and that search satisfies the refined properties.
The proof amounts to showing that C R 26 is adequate to guarantee that the search process encounters every message before the end of the second pass. That is, if we can prove the following invariant, then the refinement is correct:
L7:
inv. search(P,2,t) Þ á " Q : 0 £ Q < P :: t £ min_work(Q) ñ
Consider the implications of C R 26 when the second pass of the search process is at some processor P, which is also the destination for some message which must be processed before time can move forward. In order to have L7, it must be the case that the message under consideration has been folded into the computation of min_work, a result which in fact follows from C R 26. Since the location of the search process and the message destination are the same, we have PÊ=ÊS, so C R 26 allows us to conclude Q twixt(P,N-1) and Q twixt(R,P), so QÊ=ÊS, that is, the message has arrived at its destination, so each of the min computations in C R 27-C R 29 is guaranteed to compute the minimum for all nodes up to P, which gives us L7. C R 13 follows immediately from L7 and the transitivity of leads-to.
Outstanding Violations. All the architectural constraints have been satisfied for all elements of the program.
The derivation process is complete.
Remarks
Common to all research on program refinement is the desire to create a repertoire of standard transformations which may be applied in a systematic manner, almost mechanically. Although we may be less willing to believe that program refinement is likely to succumb to such attempts at mechanization, we too discovered the existence of several broad classes of program refinements which, far from being mechanical in nature, entail only minimal re-verification of the program. This is a significant result since the re-verification of each new program could, in general, be a very time-consuming activity.
The first class of refinements could be termed straight data refinement. One example is provided by the distribution of the gclock tuple in the consensus bus architecture. This refinement has two characteristics which allow us to forgo most re-verification in the resulting program. First, the coupling invariant that ties the old and new data representations defines an equivalence relation and, thus, eliminates any need to re-write the specification to use the new representation. Second, the state transitions taking place in the old and new programs are also in a oneto-one relation. The resulting program thus satisfies the same safety and progress properties as the original one.
The second class of refinements are the synchronous process distribution refinements such as the distribution of gtick in the consensus bus example. No re-verification was required for this refinement because the semantics of the resulting synchronic group are identical to that of the original transaction. This refinement allowed us to convert one global transaction into a set of transactions which perform local actions but are synchronized so as to accomplish the same total effect. The resulting structure has great potential for parallel execution in a distributed architecture which provides the needed synchronization mechanisms. The refinement as performed in the example was aided by the fact that both the original query and actions were easily partitioned among the available processors.
A third class of refinements is the asynchronous process distribution. The distribution of the gdeliver transaction (in both examples) is illustrative of this class. One global transaction was replaced by a number of asynchronous local transactions. This was possible because the processing performed by the original transaction was itself ÒasynchronousÓ in the sense that each time the original was executed, the changes it made to the global system state were in fact local and non-interfering.
The fourth class of refinements is the serialization, as illustrated by the final refinement in the consensus bus architecture which combines two non-interfering transactions (task and deliver) into a single transaction. This refinement is likely to be useful any time multiple activities may need to be grouped together for architectural or other practical reasons, e.g., access to shared data. The transformation described above takes advantage of the fact that the two transactions perform actions which cannot interfere with one anotherÑtask can only execute when there are no messages, and deliver has nothing to do when all messages have been delivered. Had the two transactions interfered in some way, it would have been necessary to refine one to eliminate the interference before they could be combined.
Discussion
Initial research on program derivation dealt exclusively with sequential programs and relied upon the weakest-precondition calculus [10] . Two general program construction strategies emerged from this work:
algorithm refinement [2, 12, 17, 18, 24] which is concerned with procedural abstractions, and data refinement [16, 19] which is concerned with data representations. As interest shifted into the realm of concurrent systems, it became necessary to consider new program derivation strategies. Chandy and Misra, in their work on UNITY [7] , build on the legacy of algorithm and data refinement. They advocate working largely within the realm of specifications and deferring the writing of a program until the very end of the refinement process. To date there has been a great deal of program derivation work within the UNITY framework, e.g., [8, 13, 23 ]. Lamport's work on TLA (Temporal Logic of Actions) [14] provides an alternate perspective on specification refinement, one in which a program need not be viewed as being distinct from specifications but merely an executable specification. The refinement proceeds until the specification becomes executable on the desired architecture.
While specification refinement starts with a high level abstract view of the problem and gradually confers on it more concreteness, program refinement starts with a correct program early in the design process. Back and Kurki-Suonio [3] , for instance, show that standard program transformations can be used to transform centralized systems (which may be easy to describe, but inefficient to execute) into decentralized systems, which can then be mechanically converted into CSP processes for execution. In related work, Back and Sere [4] apply a series of correctness-preserving transformations for the purpose of changing an initially large-grained (possibly sequential)
program into a fine-grained, highly-concurrent oneÑthey manipulate the program to reduce the interference among its statements and then use a ÒstockÓ refinement to distribute the computation. Similar program-transformation ideas have been explored by many other researchers, e.g., [11, 15] , mainly in the context of the CSP model.
The notion of combining specification and program refinementÑmixed refinementÑseems to be a natural next step to consider in program derivation research. Our first experience with mixed refinement took place when we began using Swarm to investigate a formal framework for deriving parallel production systems [22] . We revisited the idea when we started looking at the notion of formalizing the specification of architectural constraints. In both instances we relied on specification refinement to achieve generality and on program refinement to seek compliance with a specific target architecture or computing environment. This strategy is based on the notion that specifications are abstract while programs are concrete and, therefore, more readily related to an architecture. Yet, specification refinement is usually targeted to a specific architecture. The question thus becomes: At what point it is appropriate to bring in architectural considerations? Although we do not have a definitive answer, our experience argues in favor of seeking a high degree of architecture independence before beginning to target the design to a specific architecture.
There are practical considerations that re-enforce this view, e.g., early locking into the wrong architecture could invalidate significant amounts of work.
Conclusions
In this paper, we have shown that architectural constraints can be expressed using assertions about programs in the style of UNITY and Swarm logicsÑthe same notation we use to write formal behavior specifications. By unifying the notations, we are able to factor the architectural constraints directly into the program derivation process.
Our current approach requires a program to be generated through specification refinement before the architectural constraints can be consideredÑthe architectural constraints involve assertions over auxiliary tuples (variables) whose introduction requires the presence of a program. To the best of our knowledge, this is the first time that architectural constraints have been specified formally and compliance of the program to the architecture for which it is intended has been verified formally. While our specification refinement method is not novel, building as it does on prior work with UNITY, the program refinement strategy includes many original ideas. The novelty is due to the reliance on Swarm and due to the involvement of architectural constraints.
