In this article we develop a model for applications running on multiprocessor platforms. An application is modelled by task graphs and a multiprocessor system is modelled by a number of processing elements, each capable of executing tasks according to a given scheduling discipline. We present a discrete model of computation for such systems and characterize the size of the computation tree it suffices to consider when checking for schedulability.
Introduction
Modern hardware systems are moving toward execution platforms made up of multiple programmable and dedicated processing elements implemented on a single chip, known as multiprocessor system-on-chip (MPSoC). The different parts of an embedded application are executing on these processing elements; but the activity of mapping the parts of an embedded program onto the platform elements is non-trivial. First of all, there may be various and often conflicting resource constraints. Real-time constraint, for example, should be met together with constraints on the use of memory and energy. There also are a huge variety in freedom of choices in the mapping of an application to a platform because there are many ways to partition an embedded program into parts, there are many ways these parts can be assigned to processing elements, and each processing element can be set up in many ways.
In this work we aim at models and tools for analysis of problems that must be considered when an application is mapped to an execution platform. Such models are called system models [25] as they comprise a model for the application executing on the platform, and the analysis of such systems is also called cross-layer analysis as it deals with problems where decisions concerning one layer of abstraction, e.g. concerning the scheduling principle used in a processing element, has influence on the properties at another level of abstraction, e.g. a task is missing a deadline.
To illustrate the problems we shall address, consider the simple example in Fig. 1 , where the application is specified by five cyclic tasks τ 1 , . . . ,τ 5 and three processing elements pe 1 ,pe 2 ,pe 3 . There are causal dependencies between tasks, and τ 1 , for example, must finish before τ 2 can start (written τ 1 ≺ τ 2 ). We want to find the shortest period where all tasks meet their deadlines and analyze two runs corresponding to the two possible execution times for τ 1 in Fig. 2 . In both runs, τ 1 and τ 3 are executing on pe 1 and pe 3 , respectively, in the first time step, where no task is executing on pe 2 due to the causal dependencies. The later time steps have similar explanations. Observe that the shortest possible period is π = 4 corresponding to the case where the best-case execution time bcet τ 1 = 1 is chosen for τ 1 . Thus, an analysis based on the worst-case execution time wcet τ 1 = 2 would, in this case, not lead to the worst-case scenario. This is an example of a multiprocessing timing anomaly [7] exhibiting a counterintuitive timing behavior. A locally faster execution, either by making the processor faster or by making the algorithm more efficient, may lead to an increase of the execution time of the whole system. The presence of such behavior makes multiprocessor timing analysis more difficult [27] .
It is easy to check whether a period π = 3 can be achieved for this application, simply by changing the priorities so that τ 4 gets higher priority than τ 2 . But the problems cannot get much bigger than the one in Fig. 1 before the consequences of design decisions cannot be comprehended, and it is necessary to have tool support for the design space exploration [10, 24] .
Related work
As indicated by the example in Fig. 1 , the schedulability analysis of systems relates to the classical job-shop scheduling problem [23] . A job-shop problem is specified as follows: suppose a number of non-cyclic tasks and a number of machines (processing elements) are given. Each task τ is assigned a machine for a specified amount of time (bcet τ = wcet τ ). A job is a sequence of tasks (e.g. τ 3 ≺ τ 4 ≺ τ 5 ), and the job-shop optimization problem is the problem of finding a minimum-time schedule for a collection of jobs (two in the above example). This problem is known to be NP-hard. The problem we address can be considered a generalization of the job-shop problem: we consider cyclic tasks, and tasks can have an offset, best case and worst-case execution times. Furthermore, we consider sequential components of task given by directed acyclic graphs (not just a linear order), and we consider a variety of preemptive scheduling disciplines. We will, however, just consider a decision version of the problem. In [2] it is shown that finding an optimal schedule for a job-shop optimization problem corresponds to finding a shortest path in a certain kind of acyclic timed automata.
Most modelling methodologies for multiprocessor embedded systems are based on the Y-chart of system design [25] . In the Y-chart there is a clear distinction between the application and the execution platform, and there is an explicit mapping of the elements of the application onto elements of the execution platform. This mapping results in a particular system design that can be evaluated using simulation or formal analysis.
Simulation-based methodologies are still the predominant technique for performance evaluation and, for example, MPARM [17] is a SystemC-based [16] framework for multiprocessor MPSoC architectures aimed at exploring on-chip communication networks. It is a cycle-accurate model that requires detailed knowledge of the execution platform. The detailed level allows for executing compiled software, including the real-time operating systems, on the platform model. Refs. [11, 22, 9] propose a higher level modelling approach, by viewing the application as a set of tasks executing on a real-time operating system. As embedded software plays a dominating role in defining the functionality of an embedded system, choosing the right scheduling policy has a distinct influence on the system's rate and reactivity. All three approaches are based on SystemC and target single processor systems.
The challenges of MPSoC platforms have, for instance, been addressed in [25, 10, 24, 13, 19, 21] . In [13] , the use of virtual processing units (VPU) for representing processors is proposed. Each VPU supports a priority-based scheduler that models the execution of software tasks. Tasks are modelled as communicating finite state machines, where each transition is regarded an atomic operation taking a fixed amount of processor cycles. A similar approach is used in ARTS [20, 21] , where synchronization among processors and resource allocation is incorporated into the model of the real-time operating system. The designer can make simulations analyzing quantitative properties such as timing, memory and power usage, as well as communications. A tool is proposed in [24] for performance evaluation and exploration based on SystemC. The focus is multimedia applications that are modelled as Kahn process networks and the performance is evaluated using a trace-driven simulation approach. SystemCoDesigner [10] is a SystemC-based simulation framework that supports automatic performance evaluation and design space exploration. The design space exploration in [10] is based on a multi-objective evolutionary approach. A similar approach is used for design space exploration in CHARMED [14] which focus on periodic multi-mode embedded systems.
The advantage of simulation-based tools is that the designers, early in the development process, receive insight into the feasibility of the chosen designs before the concrete implementations are considered. However, although the simulation tool can exhibit a collection of executions of a system, it can in no way be used to guarantee that every execution is meaningful. In order to handle shortcomings of simulation-based approaches, several formal approaches have been proposed. In [26] , an approach for analyzing communication delays in message scheduling, together with optimization strategies for bus access, is presented. Thiele et al. [30] provide a real-time calculus for scheduling of preemptive tasks with static priorities and, in [28] , a formal approach based on event flow interfacing is described. A simple tool for analysis of interfacing is provided, together with algorithms for configuring a global analysis model.
In [6] , a timed-automata approach for schedulability analysis of single processor systems is provided. It appears that the computation model behind preemptive scheduling strategies is stopwatch automata, for which it is known that the reachability problem is undecidable; however, in [6] it is shown that the schedulability problem for extended timed automata is decidable for preemptive scheduling strategies. This approach is used in the Times tool [4] , which considers schedulability of tasks on a single processor only. As for the Times tool, our approach is based on a timed automata model.
In [8] , Uppaal is used in connection with verification and analysis of the legOS scheduler. We provide a more general model where different schedulers can be verified on multiprocessor platforms. Altisen and Tripakis propose in [31] an implementation methodology for transforming a timed automaton into a program, and through that, checking properties of the execution of that program. This is done by modelling the program as well as the execution platform as untimed and timed automata. However, the notion of an execution platform is basically modelled in terms of the digital clock of the platform. Many issues such as multiprocessor anomalies, which our work concerns, cannot be investigated through this methodology. In [2, 1] , the timed-automata model is proposed as a natural tool for posing and solving scheduling problems in general. However, the concept of an execution platform is not considered, and therefore issues such as the multiprocessor anomalies are not investigated.
Overview: In this paper we provide, in Section 2, a model for applications (in terms of task graphs), execution platforms (in terms of a number of processing elements each capable of executing tasks according to a specified scheduling discipline), and systems (in terms of a mapping of an application onto the execution platform). Furthermore, we present the computational model for systems (in Section 2.4) and a bound on the size of the computation tree (in Section 2.5) it suffices to consider in order to decide whether all tasks meet their deadlines. We present, in Section 3, a timed-automata [3] implementation of the model, and, in Section 4, a tool using Uppaal [5, 15] as backend, which can be used for design space exploration. Experimental results are presented in Section 4.1. Finally, the last section contains a conclusion.
Model for a system-level framework
We shall now give a model for a system consisting of an application which is running on an execution platform. An execution platform is made up of multiple programmable and dedicated processing elements (pe) and an application can be divided up into a number of tasks (τ ), each of which represents a piece of sequential code. These tasks might have timing constraints as well as causal dependencies with other tasks.
Having processing elements that can execute several different tasks and manage execution time dynamically introduces a need for a dedicated real-time operating system (os) as a layer between the application and the execution platform. The real-time operating system should manage scheduling of the different tasks as well as the dependencies tasks might have with each other.
A system-level framework should be recognized as a framework for the overall system comprising application, real-time operating system and processing elements. There are tools available for simulation of MPSoCs -one of these is ARTS [21] , and the model in this section as well as the timed-automata model of the next section have background in concepts and features of ARTS. Fig. 3 provides an overview of the different components of a system-level framework. A system in ARTS is a mapping of an application (made up of a number of tasks), onto an execution platform. Dashed arrows from tasks to processing elements in Fig. 3 represent this mapping, e.g. τ 1 is mapped to pe 1 . The arrows in the task graph show dependencies between tasks, e.g. τ 1 and τ 2 must finish execution before τ 3 can start. Each processing element has its own real-time operating system. The communication network is modelled as a special processing element on which communication tasks are mapped (See [19] for further details). We shall now present a formal model for the execution of an application on an execution platform. 
Application -the task model
Let a finite set T of tasks be given. Each task τ ∈ T is characterized by a period π τ ∈ N, a best-case execution time bcet τ ∈ N, and a worst-case execution time wcet τ ∈ N, where wcet τ ≥ bcet τ > 0. A task τ needs a certain amount, between bcet τ and wcet τ , of time units of a processor's time to finish its job in a given period. (This notion is formalized later when we introduce the execution model for multiprocessor platforms.) Furthermore, a task τ is characterized by an initial offset o τ ∈ N, which means that the first period of τ starts o τ time units after the system has started. Thus, the nth period of task τ is the time
An application is modelled by a task graph G = (T , ≺), where ≺ ⊆ T × T is a directed, acyclic graph. An edge (τ ,τ ) ∈≺ (also written τ ≺ τ ) represents a causal dependency, i.e. τ must finish its job before τ can start.
A
, is a connected sub-graph of G, for which • all tasks have the same period π Ts , i.e. π i = π j = π Ts for τ i ,τ j ∈ T s , and • the offsets of tasks are so close to each other that their first (and hence the nth) period overlaps, i.e.
We shall, from now on, assume that G can be partitioned into sequential components
The model of the platform
A platform consists of M ≥ 1 processing elements (also called processors) PE = (pe 1 ,pe 2 , . . . ,pe M ), where each processing element pe i is capable of executing tasks according to a given scheduling principle. We will here consider the following preemptive scheduling strategies: rate-monotonic RM, fixed-priority FP and earliest-deadline-first EDF scheduling. There will be no principal difficulties in extending the model with more scheduling principles; but this will, of course, add more details.
The system model
A system consists of a mapping m : T → {1, . . . ,M} of tasks to processing elements, and a configuration Sch : PE → {RM,FP,EDF} of the scheduling principle used by each processing element. We shall often consider the set of tasks T pe i mapped to a particular processing element pe i , i.e. let
Scheduling of tasks
We assume that each task has a unique number given by an injective function pr : T → {1, . . . ,card(T )}. This numbering is used to define total orderings of tasks in connection with the various scheduling principles.
A task τ has higher fixed priority than τ , written τ > FP τ , iff pr(τ ) < pr(τ ), i.e. smaller number given by pr means higher priority.
For rate-monotonic scheduling, the priority relation among tasks is primarily given by their periods (shorter periods mean higher priority) and secondarily by their numbering according pr:
A task τ has higher rate-monotonic priority than τ , written
The priority relations > FP and > RM are total orders on the set of tasks due to the unique numbering given by pr, and, therefore, there will never be a non-deterministic choice when scheduling according to the rate-monotonic and fixed-priority disciplines in the system.
The relations > FP and > RM are simple since the fixed-priority and rate-monotonic principles are static scheduling principles, i.e. the ordering relations among tasks are independent of the current point in time. For earliest deadline first this is different as, for a given time point t, the task with the highest priority is that with the shortest distance to its nearest deadline, and hence the priority relation between tasks may change when some new period starts during the execution.
Let t ∈ N and τ ∈ T . By dist τ (t) we denote the distance from t to the nearest deadline (i.e. the start of a new period) of τ :
for the smallest n ∈ N + so that p > t
Notice that we have a simple setting where the deadline for a task is the same as the start point of its next period. The model is easily extended to cope with deadlines that are earlier than the start of the next period.
A task τ has higher earliest-deadline-first priority than τ at time t ∈ N, written τ > t
The priority relation > t EDF is also a total order, and observe that when no task has a deadline between two time points t 1 and t 2 , then > t EDF = > t EDF , for every t,t where t 1 < t,t < t 2 . Thus, the earliest-deadline-first priorities can change at time points only when some new period for a task starts.
The model of computation
To model the computations of a system, the notion of a state, which is a snapshot of the state of affairs of the individual processing elements, is introduced. Consider a processing element pe i . For that processing element the state component must record which task in T pe i is currently executing, where we denote by ⊥ that no task is currently executing on pe i . Furthermore, for every task τ ∈ T pe i , the state component also records the execution time i (τ ) ∈ {bcet τ ,bcet τ + 1, . . . ,wcet τ − 1,wcet τ } that is needed by τ to finish its job in the current period. We call i an execution vector for pe i . Each time a new period for τ starts, a non-deterministic choice is taken concerning the execution time for τ in that period. A trace (or history) is a finite sequence of states:
where k ≥ 0 is the length of , denoted by length( ). A trace with length k describes a system behavior in the interval [0,k[. Consider a task τ ∈ T . The completed periods of τ in a trace is the set
and the current period number of τ in is
The nth period, n ≥ 1, of τ in = σ 1 σ 1 · · · σ k is the (possibly empty) sub-sequence of :
Notice that (τ ,n) consists of π τ states just when n ∈ Completed (τ ) and fewer (possibly no) states otherwise.
The execution time exec σ (τ ) of a task τ ∈ T pe i , in a state σ is
This notion extends to a finite sequence of states as follows:
We denote by Finished (τ ,n) ∈ Bool whether τ ∈ T pe i has finished its job in the nth period, n ≥ 1, in :
where i is the execution vector for processing element pe i in the last system state of , or the zero-execution vector 0 i , where 0 i (τ ) = 0, for τ ∈ T pe i , if the trace is empty. We shall now define the set of possible successor components states for a processing element pe i given a trace with length k. Let i be the execution vector for pe i in the last state in or the zero-execution vector if the trace is empty. For each τ ∈ T pe i , there are the following choices for the execution vector for the next state component of pe i :
•
e. the first period for τ has not started yet,
.e. if a new period for τ starts at time k, then one of the possible execution times for τ is chosen, and
Let EV (i) be the (finite) set of all possible execution vectors for the next state component of pe i given . For a given ∈ EV (i), a task τ ∈ T pe i is enabled given and , if
is not yet finished in the current period, and • Every task τ on which τ depends, i.e. τ ≺ τ , is finished in the nth period, i.e. Finished (τ ,n) = true.
Let Enable ( ,i) ⊆ T pe i be the set of enabled tasks on pe i given and . The enabled task (if any) with the highest priority is selected to execute on the processing element:
where k = length( ) and the priority relations > FP and > RM are extended to the time domain in the trivial way:
The set of possible next component states for pe i is defined by
and the set of possible next system states is defined by
The computation tree for a system is a finitely branching, infinite tree, where the root is the empty trace and the internal nodes are (labelled by) system states. Furthermore,
• there is an edge from the root to a distinct node for each system state in Next , and • for every internal node node in the tree, let node be a trace of system states leading from the root to node. For every system state σ ∈ Next node there is an edge from node to a distinct node labelled by σ . A run of the system is an infinite sequence of states
occurring on some path starting from the root of the computation tree. Notice that when the worst and best case execution times are equal for every task in the system, then there is exactly one run of the system, as the scheduling is deterministic.
We call a system schedulable if for every run, each task finishes its job in all its periods. We shall now show that this problem is decidable, and in the next section we shall use Uppaal to solve it.
Decidability
We now consider the problem of determining schedulability of a system and we will give an upper bound on the size of the part of the computation tree it suffices to consider when checking for schedulability.
We first establish a periodic behavior of the priority relations among tasks. The two relations > t FP and > t RM for the static scheduling principles are trivial as they are, in fact, static. We just need to consider the priorities with regard to the earliest-deadline-first principle.
To this end, let a situation at time t, sit t , t ∈ N, be defined by the k-tuple:
where k = card(T ) and the numbering is given by pr. For a given time t, the earliest-deadline-first priorities can easily be extracted from sit t (it would actually suffice just to consider situations for tasks that are mapped to processing elements with an earliest-deadline-first discipline).
Consider an arbitrary task τ . It is easy to see that dist τ satisfies the following properties:
Therefore, for every t,c ∈ N: 
Hence, when the time of the offset for τ has passed, dist τ becomes periodic with period π τ and, hence, any multiplum of π τ is a period as well. This generalizes to situations in the following way. Let O M = max {o τ 1 , . . . ,o τ k } be the maximal offset in the system, and
. . , π τ k } be the hyper-period for the tasks, i.e. the least common multiple of all periods of tasks in the system. Since the period of any task in the system is a divisor of the hyper-period, we have, using (3), the following periodic properties:
for every t,c ∈ N.
Hence, the infinite sequence of situations
has, using (5), the following form:
Therefore, for any time t ≥ O M + H , we meet situations (and scheduling priorities) that we have seen earlier. This does not mean, however, that it suffices to consider the computation tree to a depth O M + H . The problem is that the execution times may not have "stabilized" yet at time O M + H as the example in Fig. 4 shows. Even though the utilization U = 1/3 + 1/3 + 2/3 = 4/3 > 1 and the system obviously is not schedulable, the first deadline is missed three hyper-periods after the maximal offset. Consider a run ρ = σ 1 σ 2 σ 3 · · · where, for every task, the same execution time is chosen throughout the run, i.e. for τ ∈ T pe i and j,k ≥ o τ we have that i (τ ) = i (τ ), where σ j = ((s 1 , 1 ), . . . ,(s i , i ) , . . . ,(s M , M )) and σ k = ((s 1 , 1 ), . . . ,(s i , i ) 
For a given t ∈ N, let ρ t be the prefix of ρ up to time t and exec ρ (τ ,t) be the execution time of τ in the current period up to time t in ρ t , i.e.
Consider a time t ≥ o τ . At time t, some tasks may not have started yet and, therefore, τ may have been granted more executing time in its current period at time t than one hyper-period later at time t + H , i.e.
where we exploit that some task is executing on a processing element whenever there is an enabled task on that elementexecution time is not wasted. When a time t p ≥ O M is reached where for every task t ∈ T :
then ρ is periodic from time t p , i.e. − o τ ) . Thus, in order to search the computation checking for schedulability, it suffices to search to the depth:
A missed deadline that occurs deeper in the computation tree will also occur at a depth closer to the root, but possibly on another path. Let us examine the number of nodes at the following two depths O M + H and O M + H · (1 + τ ∈T X wcet τ ) in the computation tree. These numbers are the minimum and maximum number of paths in the tree that need examination in order to check for schedulability. For a given task τ , we let
• periodicChoices τ denote the number of non-deterministic choices for execution time in each period for τ , i.e.
• initialChoices τ denote total number of non-deterministic choices for evaluation time of τ until O M , i.e.
• hyperChoices τ denote the number of non-deterministic choices for the evaluation time of τ in a hyper-period, i.e.
Hence the total number of non-deterministic choices for the system in the time interval
which equals the number of nodes in the computation tree at depth O M + H , i.e. the minimum depth in the computation tree to check before a system can be deemed schedulable. Unschedulability may be determined earlier.
The total number of non-deterministic choices for the system in the time interval
which equals the number of nodes in the computation tree at depth O M + H · (1 + τ ∈T X wcet τ ), i.e. the depth in the computation tree it suffices to search when checking for schedulability.
A small example showing a high complexity
It is easy to give a small system with a huge number of states in the computation tree up to the depth O M + H . Consider, for example, the system with three tasks given in 
A timed-automata model for MPSoC
Having a decidability result for schedulability, the next question is to get an implementation of the decision algorithm. One way would be to construct a program directly on the basis of the computation tree and the results from Section 2.5. We will, however, take another approach as we aim at a more general framework supporting both simulation and verification of systems, and, therefore, we will build a model that comprises the components of MPSoCs and exhibits the parallel activities occurring in such systems. We will aim at a model comprising the concepts supported by the ARTS framework [20] and our approach is to develop a model of systems using a model-checking tool. Since the model of computation for systems is discrete, we could use a system like SPIN [12] . But the ARTS framework has notions that are naturally modelled by clocks and timed automata [3] , and, therefore, we will use the Uppaal [5, 15] system for modelling, verification and simulation. We will not give introductions to timed automata and the Uppaal system in this paper. For such introductions we refer to [3, 5, 15] .
As the current version of Uppaal does not support stopwatches, and we are dealing with preemptive scheduling (where the running time of a task can be temporarily stopped), we make the execution time in the Uppaal model discrete. The clocks handling the periodicity and the dynamically updated scheduling criteria still operate continuously. In a SPIN model the periodicity and the dynamically updated scheduling criteria would be handled in discrete steps. Thus, it has a cost to get rid of the clocks. Whether a SPIN model would give more efficient verification a later experiment must show. At this point we have chosen to use Uppaal as this will give a more natural model. The Uppaal model is not a direct implementation of the model of computation. However, the notions and general structure introduced are recognizable in the Uppaal model, and we explain the Uppaal model by relating it to the model of computation.
We will give a top-down presentation of the timed-automata model, where a system is described as a parallel composition of an application and an execution platform:
TA(pe j ) DynamicPriorities where denotes parallel composition of timed automata. Thus, an application consists of a collection of timed automata for tasks combined in parallel and an execution platform consists of a parallel composition of timed automata for processing elements combined in parallel with a single timed automaton handling the dynamic priorities. The timed automaton DynamicPriorities works for all processing elements, having earliest deadline first as scheduling discipline. From a model point-of-view it would be more natural to have one such automaton for each processing element; from a verification viewpoint, the global timed automaton is preferred in order to reduce the number of clocks needed for this purpose to just one. The definitions of TA(τ ),TA(pe j ) and DynamicPriorities are given in Sections 3.1, 3.2 and 3.2.4, respectively.
The timed automata comprising a system communicate by synchronous communication and shared variables. We give an overview of the channel communication in Fig. 6 . Suppose that the set of tasks {τ j 1 , . . . ,τ j k } = T pe j is executed on the processing element pe j . The timed automata for these k tasks and the processing element pe j share four channels ready j , run j , preempt j and finish j . The model will be constructed so that all channel events will occur at integer time points only, and the correctness of the construction relies on that property. 
Template for a task
We present the timed automaton TA(τ ) for a task τ ∈ T pe j (tau) with offset o (offset), period π (pi), and best-case and worst-case execution times bcet (bcet) and wcet (wcet), respectively, in two steps. In this Uppaal model offset and pi are global arrays with an entry for each task τ . In the same way there are four channel arrays ready, run, preempt and finish each with an entry for every processing element pe.
Simplified template for a task
A first simplified template can be seen in Fig. 7 together with a table explaining the identifiers. We will focus on timing and channel synchronization in this part.
A clock cp is used to handle the offset and the periodicity of the task. The timed automaton stays in the Start location until the offset time has elapsed. In the remaining locations, the task has entered the periodic behavior, or the "error-location" missedDeadline.
In the Done location, the task has finished its job in the current period and is awaiting the start (when cp == pi) of the next, where the transition from Done to Released is taken. This is signalled to the processing element pe j using the channel ready j . In this transition the clock cp and other variables are initialized. In particular, a choice is taken setting e non-deterministically to a value in {bcet,bcet + 1, . . . ,wcet}. This corresponds to the choices for the execution vector in the computation tree when a new period for τ starts. Furthermore, the integer variable cr holds the execution time for τ in the current period and is initialized to 0.
In the "standard version" of Uppaal clocks cannot be stopped. 1 Since we have preemptive scheduling, we cannot use a clock to hold the execution time for τ in the current period. We therefore have to introduce the integer variable cr and update it in discrete steps as described in the following.
The transition from Released to Running takes place when the processing element pe j sends a signal on the run j channel to the task. This transition takes place when the task is enabled and has highest priority among the enabled tasks on pe j . On this transition, the clock x, which is used to record one time unit, is set to 0. In one round from state Running via Discrete and back to Running the effect is that one time unit has elapsed and that the running time cr is incremented by 1 (as the previous run j signal is issued at an integer time point where cp < pi). Notice that the committed location is introduced as Uppaal does not allow disjunctions in guards.
Transitions from Released and Discrete to missedDeadline are taken when a deadline is missed. Furthermore, a transition from Running to Done is taken when the task has finished its job in the current period cr == e and this is signalled to the processing element pe j by a finish j event. Since cr is not a clock we ensure that the transition is taken without delay by making the finish channels urgent. In the Running location, the task may be preempted when a task on pe j with higher priority becomes enabled. In this case the processing element issues a preempt j signal and the task makes a transition to the Released state.
Full template for a task
The full timed-automaton template for a task can be seen in Figs. 8, and 9 provides a table with explanations for the introduced identifiers. The added information concerns dependencies, value transfer in channel communications and locking mechanisms. We will briefly discuss the main parts.
Dependencies. In order to manage dependencies, two global Boolean matrices depend and origdep are used, where origdep represents the static task graph ≺, and depend the dependencies at the current time point. Initially origdep and depend are equal, and if τ i ≺ τ j and τ i has finished its job in the current period at the current time, then the entry for (τ i ,τ j ) in depend is false and we say that that particular dependency is resolved. The procedure setOrigDep updates depend each time a task is Fig. 9 . Explanation of some of the identifiers used in Fig. 8. entering the Release location. It is easy to implement this procedure, as the needed information is available in the variables: origdep, Finished, and offset.
Value transfer. Communication between the platform and the application is done over the synchronization channels ready, run, preempt and finish. Some of this communication requires value transfer in order to identify the task in question (as all tasks on a given processing element share the same channels). The global array tauid is used so that tauid[j-1] handles the communication on pe j .
Locking mechanisms. In order to avoid some undesired behavior, two locking mechanisms are used. One locking mechanism -implemented by the array disc and the Boolean function lockT -makes sure that all tasks are able to react (i.e. they are in the location Running) when a preempt signal can occur. The other locking mechanism -implemented by the array taskFinishing and the Boolean function lockPE -makes sure that all finish signals are reacted to by the platform before any ready can be handled. This is done in order not to preempt a task that has actually just finished.
Timed-automata model for processing elements
The model for a processing element is described by a parallel composition of a controller, a synchronizer and a scheduler:
TA(pe j ) = Controller j Synchronizer j Scheduler j
where the controller coordinates the activities inside the processing element (i.e. with the tasks, the synchronizer and the scheduler) and it coordinates the interaction between the processing elements, which is needed since dependent tasks may be mapped to different processing elements. The synchronizer maintains a consistent view of the dynamic dependency relation between tasks, and the scheduler implements the scheduling discipline.
The communication internally on each processing element and between the different processing elements is shown in Fig. 10 in terms of communication between pe j and pe l .
Template for the controller
A timed-automaton template for a controller is given in Fig. 11 and the identifiers introduced are explained in Fig. 12 . The main idea is that the controller, in the initial location Idle, can react to reschedule signals from other controllers and to ready and finish signals from its own tasks. On such events it invokes its synchronizer and scheduler on the synchronize and schedule channels as needed, and before reentering the Idle location it issues the necessary preempt, run and reschedule signals. Notice that time does not progress in one round from the Idle location and back, as all other locations of the controller are committed or urgent. The single urgent location is used to capture all finish signals available at that point in time. Each controller maintains arrays describing the tasks that are currently released, enabled and executing, and on the basis of this information it is not difficult to implement the procedure opdDep, which updates the dynamic dependencies due to a finish signal. Notice that the flag rescheduleNeeded is set by the synchronizer and the variable selected is set by the scheduler.
Template for the synchronizer
The template for the synchronizer is given in Figs. 13 and 14 gives an explanation of the identifiers introduced. When receiving a synchronize signal from the controller, the synchronizer updates the dynamic dependencies caused by the currently released and finished tasks, using the procedures syncReleased and syncFinish, respectively. These procedures, which maintain the arrays WaitDep, Enabled and Released as well as the flag depCh, are easily implemented. In the transition from the Synchronizing location to the Idle location, the flag rescheduleNeeded is set when a dependency has changed (i.e. depCh is true).
Template for the scheduler
The template for the scheduler is shown in Figs. 15 and 16 provides an explanation for the identifiers introduced. The procedure tasksReadyOnProc initiates the scheduling decision making by finding a task that is enabled, setting selected accordingly, and setting aTaskIsReady to true. If no tasks on the processing element are enabled, aTaskIsReady is set to false.
The procedure findHighestPriority finds the enabled task with the highest priority on the processing element based on the scheduling discipline given in Sch and sets selected to the number of that task. The cases for the static scheduling disciplines (fixed priority and rate monotonic in our case) are straightforward implementations that are based on the definitions in Section 2.3.1. The difficult part concerning earliest-deadline-first scheduling are dealt with in the next section.
Template for dynamic priorities
The template for managing dynamically updated priorities is shown in Figs. 17 and 18 provides an explanation for the identifiers introduced. This timed automaton works for all processing elements having earliest deadline first as scheduling discipline.
The construction of this timed automaton is based on the property (6), which describes the cyclic behavior of situations, where each situation corresponds to an earliest-deadline-first priority list for the tasks. It is, of course, only necessary to consider tasks that are scheduled according to the earliest-deadline-first discipline. The self-loop for the location OffsetPriorities covers the parts until the time of the maximal offset, and the self-loop for the location Priorities covers the cyclic part of (6) . At each transition a reschedule signal is broadcasted to all processing elements (i.e. all controllers).
The hard part is the construction of values for the variables OffSteps, Steps, OffPrios and Prios. But these can be generated by a program traversing the sequence of situations until the end of the first hyper-period. The idea is simple. Start This process stops when the situation for the maximal offset is reached, and then a similar process for Steps and Prios can start.
Verification of MPSoC Using UPPAAL
The representation of a system model in the Uppaal system can be used for simulation as well as for verification. We will here focus on the verification aspect for the schedulability problem only, where we verify whether every task meets its deadline in every period. This is not the case if there is some state on some path where the Boolean variable missedDeadline (see Fig. 8 
E<>missedDeadline,
and every task meets every deadline iff this property is not satisfied.
The Uppaal system can generate a diagnostic trace for such queries and this is a very useful facility for understanding the scenario where a deadline is missed. It should be noted that this trace is expressed in terms of the timed-automata model and not in the context of the overall MPSoC model in Section 2.
We have tested the Uppaal model on a collection of examples found in the literature and the verifications gave correct results.
A tool has been developed for the generation of a Uppaal model on the basis of a simple description of a system model. The Uppaal model is developed on the basis of a java-program. This program contains constructors for the components of a system, i.e. for tasks, applications, processors and platforms. Fig. 20 contains the java-program corresponding to the example in Fig. 19 .
The system consists of two processors with two tasks on each. Rate-monotonic scheduling is used on both processors. The system has an inter-processor dependency from τ 2 to τ 3 . The tasks have the following timing constraints, where the best-case execution time equals the worst-case execution time for every task: This system has an inter-processor dependency and an offset on τ 4 . The system is not schedulable and the following trace is extracted by the tool from the diagnostic Uppaal -trace: Task: 1 1100110  Task: 2 0011001  Task: 3 000000x  Task: 4 
----111
The trace covers the first seven time units and the missed deadline is marked with x. Furthermore, the symbol -marks that the task has not yet started, 0 marks that the task is not executing and 1 marks that the task is executing. Hence, task τ 3 misses its deadline after 6 time units.
It is easy to explore various designs using this tool and, for example, changing the real-time operating system on pe 2 to use earliest deadline first will give a schedulable system. The only change needed in the java code concerns the line initializing p2 as described in the comment. Another observation, which is easily verified, is that if the offset for τ 4 is set to 0, the system is schedulable in both the case where rate-monotonic scheduling is used and when earliest deadline first is used. Hence, the case where the offset is zero for all tasks, which is the classical worst-case scenario for single processor systems, does not necessarily give the worst-case scenario in the multiprocessor case. Consider the example in Fig. 21 , which shows a multiprocessor anomaly where, contrary to usual single-processor scheduling theory, the system is not schedulable using dynamically updated scheduling criteria, but it is schedulable using statically updated scheduling criteria.
This system is not schedulable using earliest-deadline-first scheduling disciplines. However, when using fixed-priority scheduling with pr(τ 1 ) = 2, pr(τ 2 ) = 1 and pr(τ 3 ) = 3, the system is schedulable. This is easily verified using the tool. We will now examine the system given in Fig. 5 . In section 2.5 it was shown that the minimum number of nodes in the computation tree will be 7.6 · 10 10 at the depth O M + H (i.e. t = 22115). The maximal depth that is needed when checking for schedulability is O M + 8 · H (i.e. t = 176731) with 3.9 · 10 13 nodes. The schedulability was verified using the Uppaal model. The verification used 3.1 GB of memory and took less than 11 minutes on an AMD CPU of 1.8 MHz and 32 GB of RAM.
If the system is changed slightly by adding an extra choice for execution time to τ 3 i.e. wcet τ 3 = 14, the number of nodes at depth O M + H will be 8.2 · 10 10 and the number of nodes at the depth O M + 8 · H will be 4.2 · 10 13 . When attempting verification of this revised system on the same CPU, the verification aborts after 19 min with an Out of memory error message after having used 3.4 GB of memory.
Further experimental results
In order to make verification on real-life systems we have experimented [18] with a specification of a smart-phone device [29] consisting of 103 tasks on 4 processing elements. By only using worst-case execution times and, therefore, only having one run in the computation tree, we could verify the system using a specialized version of Uppaal where no history is saved and therefore less memory is used. This indicates that the model can also be used in analysis of larger real-life systems.
A more intuitive implementation of the MPSoC model would be using stopwatch automata. However, the absence of stopwatches in the standard Uppaal version moved our focus to that of discretization of the running time. A version of Uppaal under development introducing stopwatches -using over approximations for verification -has however been made available to us, and we have made some initial experiments with this. The stopwatch model eliminates the need for the integer variable cr and the clock x and introduces instead a stopwatch. Furthermore, the locking mechanism used for the discretization is eliminated. The experiments with verification of the stopwatch model have given encouraging results, as the example that gave an out of memory error with the standard version can be verified with the stopwatch version. All current experiments have provided exact results. The tendencies on all experiments so far is that memory consumption as well as verification times are reduced by approximately 40% in comparison to the original model given here.
Summary
We have developed a discrete model of computation capturing the synchronization and timing behavior of an embedded application executing on multiple interconnected processors. Tasks mapped to a processor are executed according to a particular scheduling policy. Analysis of such systems is a major challenge due to the freedom of interrelated choices concerning the application level, the configuration of the execution platform and the mapping of the application onto this platform, which often leads to timing anomalies. We have characterized the maximal size of the computation tree it suffices to be considered when checking for schedulability.
The computational model provides a basis for formal analysis of systems and has been expressed as timed automata and implemented in the Uppaal system. This allows for verification using the Uppaal model checker. We have presented experimental results on rather small systems with high complexity, primarily due to differences between best-case and worst-case execution times. By considering worst-case execution times only, where the system becomes deterministic, we have shown that it is possible to verify a smart-phone application consisting of 103 tasks executing on a platform with four processors.
We are currently extending our model to introduce costs other than timing, being able to formally verify properties such as memory and power usage. Furthermore, initial experiments with a stopwatch model have given interesting results, as the analyzes where more efficient than those using the "standard Uppaal tool" and they were precise on the examples conducted despite the fact that verification is based on over-approximations.
