The imposition of real-time constraints on a parallel computing environment-specifically highperformance, cluster-computing systems-introduces a variety of challenges with respect to the formal verification of the system's timing properties. In this paper, we briefly motivate the need for such a system, and we introduce an automaton-based method for performing such formal verification. We define the concept of a consistent parallel timing system: a hybrid system consisting of a set of timed automata (specifically, timed Büchi automata as well as a timed variant of standard finite automata), intended to model the timing properties of a well-behaved real-time parallel system. Finally, we give a brief case study to demonstrate the concepts in the paper: a parallel matrix multiplication kernel which operates within provable upper time bounds. We give the algorithm used, a corresponding consistent parallel timing system, and empirical results showing that the system operates under the specified timing constraints.
Introduction
Real-time computing has traditionally been considered largely in the context of single-processor and embedded systems, and indeed, the terms real-time computing, embedded systems, and control systems are often mentioned in closely related contexts. However, real-time computing in the context of multinode systems, specifically high-performance, cluster-computing systems, remains relatively unexplored. It can be argued that one reason for the relative dearth of work in this area is the lack of scenarios to date which would require such a system. Previously [11, 12] , we have motivated the emerging need for such an infrastructure, giving a specific scenario related to the next generation North American electrical grid. In that work, we described the changes and challenges in the power grid driving the need for much higher levels of computational resources for power grid operations. To briefly summarize (and to provide some motivational context for the current work), many of these computations-particularly floating-point intensive simulations and optimization calculations ( [2, 3, 8, 9, 10] )-can be more effectively done in a centralized manner, and the amount and scale of such data is estimated by some [11, 12] to be on the order of terabytes per day of streaming sensor data (e.g. Phasor Measurement Units (PMUs)), with the need to analyze the data within a strict cyclical window (every 30ms), presumably with the aid of highperformance, parallel computing infrastructures. With this in mind, the current work is part of a larger research effort at Pacific Northwest National Laboratory aimed at developing the necessary infrastructure to support an HPC cluster environment capable of processing vast amounts of streaming sensor data under hard real-time constraints.
While verifying the timing properties of a more traditional (e.g. embedded) real-time system poses complex questions in its own right, imposing real-time constraints on a parallel (cluster) computing environment introduces an entirely new set of challenges not seen in these more traditional environments. For example, in addition to standard real-time concepts such as worst-case execution time (WCET), realtime parallel computation introduces the necessity of considering worst case transmission time when communicating over the network between nodes, as well as the need to ensure that timing properties of one process do not invalidate those of the entire parallel process as a whole.
These are but two examples of the many questions which must be addressed in a real-time parallel computing system; certainly there are many more questions than can be addressed in a single paper. To this end, we introduce a simple, event driven, automata-based model of computation intended to model the timing properties of a specific class of parallel programs. Namely, we consider SPMD (Single Program, Multiple Data), parent-child type programs, in part because in practice, many parallel programsincluding many prototypical MPI-based [13, 15] programs-fall into this category. We give an example of such a program in Section 3. This model is typified by the existence of a cyclic master or parent process, and a set of noncyclic child or slave processes amongst which work is divided. With this characterization, a very natural correspondence emerges between the processes and the automata which model them: the cyclic parent process is very naturally modeled by an ω-automaton, and the child processes by a standard finite automaton. Our main contribution of this paper, then, is twofold: first, a formal method of modeling the respective processes in this manner, combining these into a single hybrid system of parallel automata, and secondly, a simple case study demonstrating a practical application of this system. We should note that the notion of parallel finite automata is not a new one; variants have been studied before (e.g. [6, 14] ). We take the novel approach of combining timed variants ( [1, 5] ) of finite automata into a single hybrid model which captures the timing properties of the various component processes of a parallel system.
The rest of the paper proceeds as follows: Section 2 defines the automaton models used by our system: Timed Finite Automata in Section 2.1, Timed Büchi Automata in Section 2.2, and a hybrid system combining these two models in Section 2.3. Section 3 gives a case study in the form of an example real-time matrix multiplication kernel, running on a small, four-node real-time parallel cluster. Section 4 concludes.
Formalisms
In this section, we give formal definitions for the machinery used in our hybrid system of automata. The definitions given in Sections 2.1 and 2.2 are not new [1] . However, it is still important that we state their definitions here, as they are used later on, in Section 2.3.
Timed Finite Automata
In this section, we define a simple timed extension of traditional finite state automata and the words they accept. We will use these in later sections to model the timing properties of child processes in a real-time cluster system.
Timed strings take the form (σ ,τ), whereσ is a string of symbols, andτ is a monotonically increasing sequence of reals (timestamps). τ x denotes the timestamp at which symbol σ x occurs. We also use the notation (σ x , τ x ) to denote a particular symbol/timestamp pair. For instance, the timed string ((abc), (1, 10, 11) ) is equivalent to the sequence (a, 1)(b, 10)(c, 11), and both represent the case where 'a' occurs at time 1, 'b' at time 10, and 'c' at time 11.
Correspondingly, we extend traditional finite automata to include a set of timers, which impose temporal restrictions along state transitions. A timer can be initialized along a transition, setting its value to 0 when the transition is taken, and it can be used along a transition, indicating that the transition can only be taken if the value of the timer satisfies the specified constraint. Formally, we associate with each automaton a set of timer variablesT, and following the nomenclature of [1] , an interpretation ν for this set of timers is an assignment of a real value to each of the timers inT. We write ν[T → 0] to denote the interpretation ν with the value of timer T reset to 0. Clock constraints consist of conjunctions of upper bounds: Definition 1. For a setT of clock variables, the set X (T) of clock constraints χ is defined inductively as
where T is a clock inT and c is a constant in R + .
While this definition may seem overly restrictive compared to some other treatments (e.g. [1] ), we believe it to be acceptable in this early work for a couple of reasons. First, while simple, this sole syntactic form remains expressive enough to capture an interesting, non-trivial set of use cases (e.g. Section 3). Secondly, the timing analysis in subsequent sections of the paper becomes rather complex, even when timers are limited to this single form. Restricting the syntax in this manner simplifies this analysis to a more manageable level. We leave more complex formulations and the corresponding analysis for future work.
Definition 2 (Timed Finite Automaton (TFA)). A Timed Finite Automaton (TFA) is a tuple
, where
• Q is a finite set of states,
• s ∈ Q is the start state,
• q f ∈ Q is the accepting state,
•T is a set of clocks,
• γ ⊆ δ × 2T is the clock initialization relation, and
A tuple q i , q j , σ ∈ δ indicates that a symbol σ yields a transition from state q i to state q j , subject to the restrictions specified by the timer constraints in η. A tuple q i , q j , σ , {T 1 , ..., T n } ∈ γ indicates that on the transition on symbol σ from q i to q j , all of the specified timers are to be initialized to 0. Finally, a tuple q i , q j , σ , X (T) ∈ η indicates that the transition on σ from q i to q j can only be taken if the constraint X (T) evaluates to true under the current timer interpretation. 
Example 3. The following TFA accepts the timed language
{(ab * c, τ 1 ...τ n ) | τ n − τ 1 < 10} (i.e.,≤ i < n, ∃σ . q i , q i+1 , σ ∈ δ .
Definition 5 (Run).
A run r of a TFA Σ, Q, q 0 , q f ,T, δ , γ, η over a timed word (σ ,τ), is a sequence of the form r : (q 0 , ν 0 )
satisfying the following requirements:
A TFA A accepts a timed string s = (σ , τ 1 ...τ n ) if there is an accepting run of s over A, and τ n − τ 1 is called the duration of the string. Figure 2 .
Note (Well-Formedness
Figure 1: Malformed TFAs. Start states are denoted with a dashed circle, and accepting states with a double line. The intent of A 1 is to allow strings of the form a, followed by arbitrarily many bs, as long as they all occur less than 10 units after the a, followed by a c. The intent of A 2 is to allows strings of the form abc, where the elapsed time between the a and b is less than 10, and that between the a and c is less than 20. Both of these can be rewritten using conforming automata, as shown in Figure 2 . 
Bounding Maximum Delay
An important notion throughout the remainder of the paper is that of computing bounds on the allowable delays along all possible paths through a TFA. Specifically, we are interested in doing so to be able to reason formally about the maximum execution time for a child process, with the end goal of being able to bound the execution time of the system-parent and all child processes-as a whole.
The idea is that we will ultimately use TFAs to represent the timing properties of a child process. Paths through the automaton from its start state to an accepting state correspond to possible execution paths of the child process' code. Certainly, proving a tight upper bound on the delay between two arbitrary points along an execution path remains a very difficult problem, but to be clear, this is not our goal. Rather, our approach involves modeling an execution path through a child process (and, by extension, its corresponding timed automaton) using an event-based model, in which selected system events are modeled by transitions in the automaton, and we rely on timing properties of the process to be guaranteed by the underlying RTOS process scheduler. The problem of computing the worst-case delay through the automaton equates to that of computing the maximum delay over all possible paths through the automaton from its start state to its accepting state:
• paths(A) denotes the set of all paths in A from its start state q 0 to accepting state q f , and
• ∆(p), for path p = (q 0 , ..., q f ), denotes the maximum delay from q 0 to q f . That is, the maximum duration of any timed string (σ ,τ) such that (q 0 ...q f ,ν) is a run of the string over A (for someν).
This problem can thus be formulated in the following manner: given a timed finite automaton A and an integer n, is there a timed word of duration d ≥ n that is accepted by A? While simple cases, such as those presented in this paper, can be computed by observation and enumeration, the complexity of the general problem remains an open question, although we highly suspect it to be intractable-Courcoubetis and Yannakakis give exponential-time algorithms for this and related problems, and have shown a strictly more difficult variant of the problem to be PSPACE-complete [4] . Furthermore, expanding the timer constraint syntax to a more expressive variant (c.f. [1] ) can only complicate matters in terms of complexity. We must be cautious, then, to ensure that we do not impose an inordinately large number of timers on a child process.
Timed Büchi Automata
Whereas we model the timing properties of the child processes of a cluster system using the timed finite automata of the previous section, we model these properties of the parent using a timed variant of ω-automata, specifically Timed Büchi Automata. We assume a basic familiarity with these; due to space constraints, we give only brief overview here. To review briefly, ω-automata, like standard finite automata, also consist of a finite number of states, but instead operate over words of infinite length. Classes of ω-automata are distinguished by their acceptance criteria. Büchi automata, which we consider in this paper, are defined to accept their input if and only if a run over the input string visits an accepting state infinitely often. Other classes of ω-automata exist as well. For example, Muller automata are more stringent, specifying their acceptance criteria as a set of acceptance sets; a Muller automaton accepts its input if and only if the set of states visited infinitely often is specified as an acceptance set. More detailed specifics can be found elsewhere-for example, [1] .
A Timed Büchi Automaton (TBA) is a tuple Σ, Q, q 0 , q f ,T, δ , γ, η , where
• q 0 ∈ Q is the start state,
• F ⊆ Q is a set of accepting states,
A tuple q i , q j , σ ∈ δ indicates that a symbol σ yields a transition from state q i to state q j , subject to the restrictions specified by the clock constraints in η. A tuple q i , q j , σ ,T ∈ γ indicates that on the transition on symbol σ from q i to q j , all clocks inT are to be initialized to 0. Finally, a tuple q i , q j , σ , X (T) ∈ η indicates that the transition on σ from q i to q j can only be taken if the constraint X (T) evaluates to true under the values of the current timer interpretation.
We define paths, runs, and subruns over a TBA analagously to those over a TFA:
Definition 6 (Path (TBA)). Let A be a TBA with state set Q and transition relation δ .
...
satisfying the same requirements as given in Definition 5.
For a run r, the set in f (r) denotes the set of states which are visited infinitely many times. A TBA A with final states F accepts a timed word w = (σ ,τ) if in f (r) F = / 0, where r is the run of w on A . That is, a TBA accepts its input if any of the states from F repeat an infinite number of times in r. This TBA accepts the ω-language L 1 = {((ab * c) ω , τ) | ∀x.∃i, j.∀k.φ } where φ is the boolean formula
Lastly, we take the concept of maximum delay, introduced in the previous section with respect to Timed Finite Automata, and extend it to apply to Timed Büchi Automata. Doing so first requires the following definition, which allows us to restrict the timing analysis for TBAs to finite subwords: Definition 9 (Subword overq). Let A be a TBA, and letq = (q m ...q n ) be a finite path over A . A finite
Definition 9 is a technicality which is necessary to support the following definition of the maximum delay between states of a TBA: Definition 10. Let A be a TBA, and letq be a finite path over A . Then ∆ A (q) is the maximum duration of any subword overq.
Algorithmically computing ∆ A (q) for a TBA A is analogous to the case for TFAs; in small cases (i.e., relatively few timers with small time constraints), the analysis is relatively simple, while we conjecture the problem for more complex cases to be intractable; we leave more detailed analysis for future work.
Parallel Timing Systems
Next, we model the timing properties of a SPMD-type parallel system as a whole by combining the two models of Sections 2.1 and 2.2 into a single parallel timing system. A parallel timing system (PTS) is a tuple P,Ā, ψ, ϕ , where
• P = Σ, Q, q 0 , q f ,T, δ , γ, η is a TBA (used to model the timing properties of the parent process)
•Ā is a set {A 1 , ..., A n } of TFAs (used to model the timing properties of the child processes)
• ψ ⊆ δ ×Ā is a fork relation (used to model the spawning of child processes)
• ϕ ⊆ δ ×Ā is a join relation (used to model barriers (joins))
A tuple q i , q j , σ , A in ψ, with A ∈Ā, indicates that an instance of A is to be "forked" on the transition from q i to q j on symbol σ , and this "fork" is denoted graphically as q i
− −− → q j , modeling the spawning of a child process along the transition. Similarly, a tuple q i , q j , σ , A in ϕ indicates that a previously forked instance of A is to be "joined" on the transition from q i to q j on symbol σ . This "join" is denoted graphically as q i Ω(A) − −− → q j , modeling the joining along the transition with a previously spawned child process 1 .
Example 12.
Consider the following timing system S 1 = P, {A}, ψ, ϕ : In theory, child processes could spawn children of their own (e.g. recursion). For now, however, we disallow this possibility, as it somewhat complicates the analysis in the following section without adding significantly to the expressive power of the model. The model can be expanded later to allow for arbitrarily nested children of children with the appropriate modifications; specifically, TBAs would need to be extended to include their own ψ and ϕ relations, as would the definition of ∆ for TBAs.
Before proceeding, it is important to note that a PTS S = P,Ā, ψ, ϕ is not itself interpreted as an automaton. In particular, we do not ever define a language accepted by S. Indeed, it is not entirely clear what such a language would be, as we never specify the input to any of the children in A. Rather, the sole intent in specifying such a system S is to specify the timing behavior of the overall system, rather than any particular language that would be accepted by it.
Consistency
With this said, we note that in Example 12, A is in some sense "consistent" with its usage in P. Specifically, since the maximum duration of any string accepted by A is 10, we are guaranteed that any instance of A forked on the q 1 a − → q 2 transition will have completed in time for the 'join' along the q 2 c − → q 1 transition and hence, the timer (T < 50)? on this transition would be respected in all cases. In this sense, all (Ψ(A), Ω(A)) pairs are consistent with timer T . However, such consistency is not always the case. Consider, for instance, the parallel timing system S 2 shown in Figure 3 . In this case, there are two child P: processes: A and B. The maximum duration of a timed word accepted by A is 10, and that of B is 20. Supposing that an 'a' occurs (and A forked) at time 0, it is thus possible that the A will not complete until time 10 − ε 1 , at which time the 'b' and fork of B can proceed. It is therefore possible that B will not complete until time 30 − ε 1 − ε 2 (for small ε 1 , ε 2 ). This would then violate the (T < 25)? constraint, corresponding to a case in which a child process could take longer to complete than is allowable, given the timing constraints of the parent process. It is precisely this type of interference which we must disallow in order for a timing system to be considered consistent with itself.
To this end, we propose a method of defining consistency within a timing system. Informally, we take the approach of deriving a new set of conditions from the timing constraints of the child processes, so that checking consistency reduces to the process of verifying that these conditions respect the timing constraints of the master process.
First, we replaceĀ, ψ, and ϕ from the parallel timing system with a new set of derived timers, one for each A ∈Ā, defining the possible "worst case" behavior of the child processes. Each such timer T A is initialized on the transition along which the corresponding A is forked, and is used along (constrains) any transitions along which A is joined. Each such use ensures that the timer is less than ∆ A , representing the fact that the elapsed time between the forking and joining of a child process is bounded in the worst case by ∆ A -the longest possible duration for the child process. As an example, "flattening" the timing system S 1 of Example 12 results in a single new timer T A , initialized along the q 1 a − → q 2 transition, and used along the q 2 c − → q 1 transition with the constraint (T A < 10)?. We then check that none of these new derived timers invalidate the timing constraints of the parent process.
Formally, we define two relations. The first of these is flattening, which takes a parallel timing system P,Ā, ψ, ϕ and yields a new pair of relations (γ, η). Intuitively, γ defines the edges along which each of the derived timers are initialized, and η defines the edges along which each of the derived timers are used:
Definition 13. Let S = P,Ā, ψ, ϕ be a parallel timing system. Then flatten(S) = (γ, η), where
The second relation takes ψ and ϕ as inputs and extracts a set of edge pairs, defined such that each such pair (e 1 , e 2 ) specifies when a derived timer is initialized (e 1 ) and used (e 2 ). shown graphically in Figure 5 , and We can now proceed with a formal definition of consistency for a parallel timing system. Recall that intuitively, such a system is consistent if the worst case timing scenarios over all child processes will not invalidate the timing constraints of the parent process-in other words, if the maximum delay between two states allowed by the child processes never exceeds the corresponding maximum delay allowed by the timers in the parent process.
Definition 14. Let S = P,Ā, ψ, ϕ be a parallel timing system, with A ∈Ā. Then the set of all use pairs of A in S is defined as pairs(A, S)
= {((q x , q y ), (q m , q n )) | ( q x , q y , σ 1 , A ∈ ψ) ∧ ( q m , q n , σ 2 , A ∈ ϕ)} for some σ 1 , σ 2 . Furthermore,γ = { q 1 , q 2 , a, {T A } , q 2 , q 3 , b, {T B } } η = { q 3 , q 1 , c, X } , where X = (T A < 25) ∧ (T B < 11) P: q 1 q 2 q 3 a ; T = 0 Ψ(A) b Ψ(B) Ω ( A , B ) c ; ( T < 2 4 ) ? A:
pairs(S) = pairs(A, S) ∪ pairs(B, S)
Definition 16 (Consistency). Let S = A , ψ, ϕ,Ā be a PTS, where
Then S is consistent if for all edge pairs ((q x , q y ), (q m , q n )) ∈ pairs(S) and all paths p = q x q y ...q m q n through A ,
We conclude this section with a few simple examples, which should help to clarify Definition 16; the following section gives a more realistic example.
Example 17. S 1 is consistent.
Proof. P ′ , the result of flattening S 1 , is shown below, with T A being the derived timer corresponding to A:
Furthermore, pairs(P ′ ) = {((q 1 , q 2 ), (q 2 , q 1 ))}, and by observation, all paths through P ′ beginning with the edge (q 1 , q 2 ) and ending with the edge (q 1 , q 2 ) take the form q 1 (q 2 ) * q 1 . All such paths p satisfy inequality 16.1, and thus by definition, S 1 is consistent.
Example 18. S 3 is not consistent.
Proof. P ′ , the result of flattening S 3 , is shown in Figure 5 . Furthermore,
There are thus two paths against which we need to test inequality 16.1: (q 1 q 2 q 3 q 1 ), and (q 2 q 3 q 1 ); the first of these fails the test: ∆ P ′ (q 1 q 2 q 3 q 1 ) = 25,and ∆ P (q 1 q 2 q 3 q 1 ) = 24.
Case Study: Matrix Multiplication
We now turn our attention to a practical application of the concepts discussed so far. Namely, we demonstrate the use of the formal validation concepts on a simple parallel, MPI-style [13, 15] matrix multiplication kernel, extracted from the larger power-grid analysis application described in [11, 12] . Our kernel implements a variant of Fox's algorithm for matrix multiplication [7] . For simplicity, we assume square matrices, and that the number of columns, rows, and processors are all perfect squares. The algorithm distributes the task of multiplying two matrices amongst all processors in the system. We give a simple distributed algorithm for matrix multiplication, and a consistent parallel timing system for that algorithm. We conclude the section with empirical results-timing measurements taken on a small, four-node real-time cluster, each node consisting of dual quad-core 2.66Ghz Xeon X5660 processors running the Xenomai RTOS with 48GB RAM. The timing measurements of the PTS, along with the usual restrictions associated with real-time computation (e.g. no virtual memory or paging, process scheduling, ensuring minimal variance in execution timings, etc.), are bounded by virtue of Xenomai's real-time process scheduler. The result is a matrix multiplication kernel which provably runs in under 9 ms per cycle for 128 × 128 double-precision matrices. We emphasize that we are not claiming the speed of the operation to be a groundbreaking result-obviously, this is a relatively small matrix size, but was so chosen as this is the order of the size required by our targeted application kernel. Rather, we give these numbers, as well as the PTS, to illustrate the process by which we analyze the temporal interactions between processes, thus showing this delay to be a provable upper bound. if self == 0 then {Master process} 5: for i = 0 to q − 1 do 6: for j = 0 to q − 1 do 7:
Algorithm
if i = 0 and j = 0 then {Master already has these chunks} 12: send(X , dest) 13: send(Ȳ , dest) 14: dest ← dest + 1 The pseudocode for the algorithm is given in Algorithm 1. Conceptually, to multiply two N × N matrices A and B using a p processor cluster, each matrix is divided into segments, which are then distributed in round-robin fashion amongst the processors of the cluster. Each processor then performs a local matrix multiplication on its own local submatrices, and the results of these local operations are aggregated (reduced) to form the matrix product A × B.
Due to space constraints, we will not describe the partitioning in detail; Figure 6 shows the partitioning and distribution of work by Algorithm 1 for a four-processor cluster. In this figure, A, B , and C are all N × N matrices. A is partitioned into 2 sets of N 2 rows each, and B is partitioned into 2 sets of N 2 columns each. The master process, p 0 , computes the local product A 1 × B 1 , and writes the result to C 1 . p 0 then sends submatrices A 1 and B 2 to p 1 , who then computes their product, writing the result to C 2 . Similarly, p 2 receives and computes C 3 = A 2 × B 1 , and p 3 receives and computes C 4 = A 2 × B 2 .
The algorithm proceeds as follows: the master process executes lines 5 through 17, which partition A and B into submatrices (lines 7-10), and send these parts out to the respective child processes (lines 12-13). Conversely, the child processes execute lines 19-20, which receive the submatrices assigned by the master process. Lines 22-23 are run by all processes, including the master process (which, in this case, participates in the task of matrix multiplication as well). Line 22 performs the local operation, line 23 writes the local result to the appropriate location in C.The entire process then repeats indefinitely, as given by the while loop (lines 2 and 24) .
Figure 6: Partitioning and distribution of matrix multiplication by Algorithm 1 across a four processor cluster. Figure 7 shows a parallel timing system for Algorithm 1 across a four processor cluster, consisting of the TBA P MM , which models the master process, and a child TFA A MM , modeling instances of the child processes. Specific events have been elided from the diagram in this case, since events in this case always represent transitions between statements.
Parallel Timing System

Parent
States in the parent automaton P are prefixed with a 'P', followed by the line number as given in Algorithm 1. For example, P3 corresponds to the state of the parent process as it is executing line 3. Additionally, lines 12 and 13 each beget three separate states-parameterized on the values of the loop induction variables i and j-and are labeled accordingly. As is commonly the case in WCET analysis, unrolling the loop nest in this fashion is necessary in order to obtain a strict upper bound on the number of iterations and, consequently, the total execution time, of the loop nest.
P forms, in this case, a simple cycle. The cycle starts at state P3, and steps sequentially through the steps (states) of the algorithm. Namely, the parent process starts at line 3 (i.e., state P3), and proceeds sequentially through lines 4 (state P4), and eventually to line 12 (P12 i=0 j=1 ). The delay between the initialization (state P3) and the first send (P12 i=0 j=1 ) is bounded by a timer, T setup1 (the idea being that this is the delay incurred by the time to "set up" the first send). Execution then proceeds to line 13 (P13 i=0 j=1 ); the delay along this transition represents the time to send the first chunk to the respective child process, and is bounded by timer T send1 . At this point, execution proceeds to line 14 (P14 i=0 j=1 ). Along this transition, there are two items to note: first, the time to process the second send is bounded by the timer T send2 , and second, the child process has now been sent the data it needs, and consequently, A 1 is forked. Execution proceeds similarly through the next six states, representing the unwound iterations of the loop nest. Child process A 2 is similarly forked on the transition from P13 i=1 j=0 to P14 i=1 j=0 , and A 3 on the transition from P13 i=1 j=1 to P14 i=1 j=1 . Execution then proceeds through lines 22 (state P22) and 23 (P23). The duration of the local matrix multiplication operation (line 22) is bounded by the timer T MM , and that through the reduce operation (line 23) by the timer T reduce . Additionally, the transition from P23 back to P3 waits for (joins with) all child processes to complete before proceeding.
Child
In this case, the child processes are modeled by the TFA A. Nomenclature is analogous to that of P: states in A are prefixed with an A, followed by the corresponding line number from Algorithm 1.
The child process starts at line 18 (state A18). The process then proceeds to receive the first block of data (line 19, state A19). The time to process the receive is bounded by timer T recv1 . Execution proceeds to receive the second block of data (line 20, state A20). The time to process this second receive is bounded by T recv2 . Execution proceeds next to the local matrix multiplication (line 22, state A22); the time spent on this operation is bounded by timer T locMM . Finally, execution proceeds to the data writeback (line 23, state A23); the time spent on this operation is bounded by timer T reduce . e 4 ) , (e 2 , e 4 ), (e 3 , e 4 )}, where
Theorem 19. S MM is consistent.
Proof. Let flatten(S
By observation, there are three paths which we must consider:
The rest of the proof follows by enumeration:
Finally, we note that the worst case delay along one iteration of the algorithm is 8.9 ms. This follows from the observation that the parent automaton P takes the form of a simple cycle with no unbound segments (i.e., subpaths which are not constrained by any timer). Specifically, P consists of consecutive pairs of segments, each constrained by pairs of timers. Consequently, we can derive an upper bound for a single iteration of the algorithm by summing the bounds of all of the timers, yielding the specified upper bound. Combined with Theorem 19, which ensures that the timing of the child processes does not invalidate this bound, we are left with a cyclic, parallel, time-bounded matrix multiplication kernel. T reduce =0 (T reduce < 4)? Figure 7 : Parallel timing system S MM for Algorithm 1 across a four processor cluster. Events have been elided for the sake of clarity. Upper bounds on timer constraints correspond to delay measurements taken over our implementation; times are given in milliseconds. Minimal variance from these bounds is ensured to the extent provided by the underlying RTOS.
Concluding Remarks
We conclude with a few closing remarks. We have presented a formal system for modeling the temporal properties of a restricted class of real-time parallel systems, with a simple example of an application kernel. As is usually the case with real-time systems, loops need to be unrolled, bounding the number of iterations, in order to obtain an upper bound on the total execution time of the loop. Algorithm 1 (intentionally) distills to a relatively simple PTS, due to the basic structure of the control flow graph of both the parent and child processes; more complex examples are of obvious interest for future work. Similarly, the model in Figure 7 in our case was derived manually-in this case, a relatively simple task. More complex examples can certainly prove to be more of a challenge, and automated tools for this task are desirable. One possible approach for such automation would be compiler-driven, whereby users could specify to the compiler (via #pragmas, for instance), events of interest, and the compiler could proceed to output the appropriate annotated control flow graph.
We assume timing behavior is consistent across all child processes, although if there were to be significant variance across child processes (e.g. heterogeneous or NUMA architectures) we could account for such behavior using different child TFAs.
Additionally, we have laid out several interesting open questions which arise out of the analysis of our relatively straightforward formulation: what is the complexity of computing the worst case delay along a single path of a TFA (TBA), and through a TFA (TBA) in general? Up to this point, we have only considered conjunctions of maximum constraints; how does this change in the presence of a more generalized constraint syntax (c.f. [1] ) ? We have largely been working with the SPMD execution model paradigmatic of many MPI-type programs. It would be interesting to investigate temporal models for other parallel models (e.g. OpenMP) as well. Lastly, our application kernel distills to a relatively simple set of automata. More complex examples are certainly of interest, and are on the horizon for future work.
