Abstract
Introduction
Multiprocessor architectures are becoming more and more attractive. From single-core designs with heat and thermal problems, chip makers are now shifting to multicore technologies. While making them decreasingly expensive, they are making very powerful platforms available. Such platforms are desirable for embedded system applications with high computational workloads such as robot controls, image processing, or streaming audio. The additional characteristic of such applications is to meet some real-time constraints; consequently they require predictable performance guarantees from the underlying operating system. This has led to renewed interest in multiprocessor real-time scheduling: scheduling algorithms together with schedulability analysis. So many theoretical results are available but mainly based on simple hardware and software architecture models [6] . These results are difficult to exploit if the assumptions are not satisfied This work has been supported by the french Agence Nationale de la Recherche through the PHERMA project (Contract ANR ANR-06-ARFU06-003). See http://pherma.irccyn.ec-nantes.fr.
which is the case for modern hardware and software architectures (processor heterogeneity, cache management, inter-processor communications, complex software behaviours, system overheads, etc.). Furthermore, due to the specific application fields where energy efficiency is critical (battery-based embedded systems), power management techniques (Dynamic Voltage and Frequency Scaling or Dynamic Power Management) are inescapable and must be taken into account. Indeed, in such a context, it is very difficult to evaluate and compare scheduling algorithms. So we think that a more global approach is needed in order to explore the adequacy between scheduling policies and (hardware and software) architectures. It is not for us a matter of implementing scheduling algorithms on true multiprocessor platform(s) but rather of designing an open and flexible framework able to simulate accurately the behaviour of those (hardware and software) elements that act upon the performances of such systems and to compute various real-time metrics in order to analyse the system behaviour and performances. Our simulation tool is called STORM which stands for "Simulation TOol for Real-time Multiprocessor scheduling" [10] and this paper presents it. Section 2 presents the related works and the motivations for our simulation approach. In section 3 we give an external overview of the tool, while section 4 describes the functional architecture of the simulator. Finally section 5 presents an example of use of STORM.
Related work and motivation of our work

Short overview of simulation tools
In comparison with the impressive number of publications and results on real-time scheduling since more than 30 years, few tools are available to apply them. On the one hand, some commercial tools exist for the complete design and analysis of real-time systems. Obviously they are not freely available. As an example, RapidRMA TM (from Tri-Pacific Software Inc.) allows the use of RateMonotonic Analysis (RMA) methodology in association with UML modelling environment and tools for fixedpriority scheduled CORBA systems. Another example is TimeWiz (from TimeSys Corp.): it is an integrated design environment for building predictable embedded real-time systems using a RMA-based timing analysis. On the other hand, open source or free products are available coming from the academic community. Some of them are the result of PhD studies, without maintenance and flexibility. As an example we can cite [3] who proposes a simulation language and a tool for the simulation of scheduling algorithms on monoprocessor systems. The simulation language is used for the description of the software architecture (tasks and their characteristics) and the tool allows statistical analysis of the scheduling algorithms on task sets. Another example is XRMA [5] . The tool which has been developed is intended to automate the analysis and to support the performance verification of real-time systems. The system is composed of tasks executing on multiple processors which communicate using the Controller Area Network (CAN) fieldbus. The tool allows the computation of both upper and lower bounds on end-to-end response time in such distributed systems. Some other tools are always under development and are supported. Here after are some examples; all of them are free open source software.
YASA YASA ("Yet Another Scheduling Analyser") [13] is a framework developed at the University of Rostock (Deutchland). Its goals are the simulation and analysis of real-time applications on different platforms (the simulator and RT-Linux). The framework supports various kinds of real-time scheduling algorithms and synchronisation protocols which can be simulated for evaluation purposes. Furthermore, the implementations of these algorithms can be tested in a real-time operating system. MAST Gonzalez Harbour & al. propose MAST ("Modeling and Analysis Suite for real-Time applications") [7, 4] , a model for representing the temporal and logical elements of real-time applications. A system representation using this model is analyzable through a set of tools including worst-case schedulability analysis and discrete-event simulation (SIM MAST). Although the current model mainly includes fixed-priority systems, it is conceived as an open model and is extensible to accommodate other kinds of systems.
Cheddar Cheddar [8, 2] is an Ada framework which implements most of the classical real-time scheduling theoretical methods. Systems to be analyzed can be described with AADL [1] or with a specific language. Cheddar provides two kinds of features: a simulation engine and some feasibility test computations. The Cheddar simulation engine provides tools to design specific schedulers and task models.
TORSCHE TORSCHE [11, 12 ] is a scheduling toolbox for Matlab developed at the Czech Technical University in Prague. Presently TORSCHE proposes many algorithms for manufacturing process (job-shop scheduling) and graph algorithms. Real-time scheduling is also addressed but currently only the very basics are supported (fixed-priority scheduler for periodic tasks).
Motivation of our work
The need to develop STORM came from the works of the PHERMA research project. This project addresses multiprocessor architectures for which one needs to evaluate the performances in term of energy efficiency for specific hardware and software architectures.
Such architectures do not comply with the academic models on which theoretical scheduling studies are conducted. Thus theoretical results are not applicable for these architectures and existing simulators do not offer the convenient entities and functionalities for our needs. So we decided to favour simulation and to design a corresponding tool. Some of the main specifications of STORM are highlighted hereafter:
• it is designed specifically for multiprocessor architectures (composed of homogeneous or heterogeneous processors); monoprocessor architectures being the simplest case;
• there is a strong need for it to be able to take into account: i) the features of hardware architecture: multicore design, multiprocessor architecture with shared memory, distributed architecture with communication network, memory architecture (L1 and L2 caches, banked memory); ii) the features concerning energy consumption for the processors and memories, in particular the capabilities for DPM (Dynamic Power Management) and DVFS (Dynamic Voltage and Frequency Scaling);
• it must be flexible and open tool: i) "flexibility" means mainly the possibility to add new components through well defined APIs (for instance, new scheduling policies; for that reason the scheduler function is outside the simulation kernel and under the user responsibility); ii) "open" means that the input/output formats are known; for that reason inputs and outputs will be formatted using xml;
• lastly, it must provide a simulation language to drive experiments (statistical studies or domain explorations) via a simulation controller. Upstream from the simulator, this controller computes the inputs, runs the simulations, and downstream computes the required metrics from simulation results.
The STORM software (release 3.0) is freely available at http://storm.rts-software.org. It is written in Java programming language which makes it portable.
The STORM simulator
For the time being, engineering efforts on STORM have concerned its simulator component since it is the re-taining element of the platform. For a given "problem" i.e. a software application that has to run on a (multiprocessor) hardware architecture, this simulator is able to "play its execution" over a specified time interval while taking into account the requirements of tasks, the characteristics and functioning conditions of hardware components and the scheduling rules, and with the highest timing faithfulness. As shown by Figure 1 , the STORM simulator needs the specification of the studied problem as input. It is based on some reference components and their characteristics available in attached libraries.
The simulation itself consists in building a chronological track of all the "run-time events" that occurred during the simulated execution, from which it is subsequently possible to compute a lot of outputs either user readable in the form of diagrams or reports, or machine readable intended for a subsequent analysis tool. More information is made available in [10] .
The STORM simulator relies on the definition of a set of Java classes. They provide additional methods that enable the use of dynamic attributes mapped on hashtables. By dynamic attributes, we mean attributes that are not explicitly declared as members in the Java code of a class, but that are created, initialized and then updated thanks to specific function calls inside the methods of this class. Besides classical object-oriented programming structures and schemes, reflection has been used for the design of our tool. Reflection [9] is a feature in the Java programming language which allows an executing Java program to examine upon itself, and manipulate internal properties of the program. It makes STORM quite open and flexible since it enables to add new classes or new attributes without putting its structure in question. 
Inputs
The system to be simulated is specified in a xml file. At present, this file is manually written by the user but it could be directly generated from any architecture specification framework subject to the supplying of a model transformation tool. As shown by the example of Figure 2, some specific tags delineate the parts of the specification and some specific attributes characterise their elements. With regard to the given example, the hardware architecture is composed of two identical processors (see the "CPUS" and "CPU" tags). Here a reference is made to the "CT11MPCore" component which is the identifier of a processor component in library. It is the name of a Java class which implements such a processor. The scheduling strategy is a global no preemptive EDF one (as for processors, it is the Java class name of this scheduler, see the "SCHED" tag). Three tasks are present in the software architecture (see the "TASKS" and "TASK" tags). All of them reference the same task type (see "className" attribute) in library. Their own period, first activation date (if not equal to zero), worst case execution time (or WCET) and deadline (if not equal to the period) are given. Moreover, a data of 10 byte size is declared which is exchanged from task T1 to task T2 (see the "DATAS" and "DATA" tags). The rate attribute (rate = "2") means that 2 executions of task T1 are necessary for task T2 to be allowed to begin. As it will be explained later, whatever it is a matter of processor, task, or scheduling strategy, other types are of course available and those libraries are extensible as often as wished.
Graphical User Interface
The STORM simulator offers a user-friendly GUI which is composed of various windows and enables him:
1. to order the simulator via a command window that looks like a console where the user types the command to be executed. It is out of the scope of the paper to explain all the available commands. As examples: -exec < f ile name.xml > asks for the simulation of the problem specified in that input xml file; -taskgraph displays the tasks and data of the considered software architecture; -plotall displays the Gantt diagrams for all the tasks and processors. More details can be found in the user manual [10];
2. to visualize some simulation results (as it will be discussed in the next section).
All those windows can be easily handled through some key combinations or function keys. The figure 3 shows a screenshot of this GUI.
Outputs
Over the simulation process, a very large set of events that have punctuated the life of the studied system are recorded with their occurrence date and some other context information. Such events are for example: the release of a job, its start of execution on a processor, its preemption, its completion, a frequency processor change, etc. From such execution tracks, either directly or with some further computation, it is thus possible to compute the required real-time metrics for analysis.
Graphical outputs At the moment, various full color diagrams are available for showing some basic results of a simulation. Figure 3 is an illustration. They are displayed as result of specific commands of the console. We enumerate the most important of them hereafter:
1. the task graph that shows possible data dependencies between tasks;
2. the Gantt diagram of a task that shows with specific marks the important dates of the life of the task (first activation, next job releases if it is periodic, deadline, completions, deadline misses, etc.) and its running intervals;
3. the Gantt diagram of a processor that shows its idle and busy intervals (together with a reference on the task being running);
4. the CPU load diagram of a processor that draws over time the ratio computed from the very beginning of the simulation (busy slots for this processor / number of slots);
5. the CPU power consumption diagram of a processor that shows over time its electrical power (in watts) computed according to the physical characteristics of its chip and its functioning states;
6. the AET graph of a task that gives the computation time value(s) for its job(s), i.e. the actual computation time requirement of the job(s) as set by the simulator kernel at each task release.
For all these time diagrams, time shift and zoom facilities are proposed. This list is not exhaustive and it can be easily extended depending on the needs.
Other outputs Simulation results can be recorded in files too, either to establish a user readable report on the performed simulation (in pdf format), or to put together the data required as inputs of the analysis tool that should follow the simulation (in txt, or xml, or other format). The approach is very simple and flexible. It just requires to provide a xml configuration file that specifies, in accordance with an easy syntax, the expected content. It is possible here to make reference to all the public entities and attributes used by the simulator.
This ability of STORM to receive and send data in an xml format makes it easy to connect with other CASE tools. It is an open framework.
The architecture of the STORM simulator
From a functional point of view, the architecture of the STORM simulator is composed of a set of entities built around a simulation kernel (see Figure 4) . The software (respectively hardware) entities are directly issued from the software (respectively hardware) architecture models specified as input. The system entities represent some runtime and operating system components.
The simulation kernel
From a functional point of view, the kernel provides an intentionally limited number of basic services for managing simulation time and some interactions between entities.
A simulation is achieved in a discrete way, i.e. the overall simulation interval is cut into a sequence of unitary slots [0,1), [1,2), ... , [t,t+1), etc., and simulation moves forward at each instant 0, 1, ... t, t+1, etc. Thus, at t, the next simulation state (for instant t+1) is computed from the current one (of instant t) and the pending simulation events at time t. One of the kernel services is a time service so as to indicate the spending of time to the simulation entities. Thus, at each slot, all the simulation entities are informed that a unit of time has elapsed (of course, it can be ignored by the targeted entity if it has no concern about time). Another kernel service is a watchdog service. It enables any entity to ask the kernel to arm a watchdog mechanism so that a specified function of a specified entity will be called when the specified delay expires.
Lastly, the kernel offers a kernel message service. For some state change or simulation event occurrence, an entity has to inform the kernel by asking it to register a preidentified kernel message. The kernel is then in charge of processing such a message. Depending on the context, it may lead it to send or not some other information to some other entities.
The simulation entities 4.2.1 The software entities
Software entities stand for the tasks and data that compose the software architecture for which the simulation is conducted. There are as many task and data link entities as specified in the xml input file. It is important to note that these entities capture the behaviour of the real components they represent only from a control viewpoint and not a functional one, i.e. no applicative programs run for the tasks, nor true exchanged data values exist. It means that a task is considered only from the execution time point of view and a data from the possible task synchronizations it may give rise to. STORM provides various task types (depending on their activation and deadline control conditions). New task behaviours can be easily specified and added to the tool. Whatever its type, the generic state diagram of a task entity inside the simulator is shown on Figure 5 . At the very beginning, a task is unexisting. As soon as its first activation occurs, it becomes ready and falls under the control of the task list manager (see section 4.2.3). Depending on the scheduling decisions, it may run (running state) and possibly be pre-empted. On its definitive completion (case of an aperiodic task), the task goes back into the unexisting state. On a job completion (for a sporadic or periodic task), it becomes waiting until all its execution conditions be met, i.e. only its next release in case of an independent task, but together with the availability of all the data it requires in case of a consumer task. It is the simulator which is in charge of setting the computation time of the jobs each time they are released. By default, the actual execution time (AET) of a new arriving job is simply set equal to the WCET of its task (the value that is specified in the xml input file). But, thanks to the specific attribute calcAET (related to the definition of a task in the xml input file), it is possible to specify another selection strategy (BCET, random value in [BCET .. WCET], value read from an input data log file, etc.).
The specification of data in the xml input file of STORM is a way to introduce functional dependence constraints in the form of precedence relations between some producer and consumer tasks. Indeed a task which is the destination of a data has to stay in the waiting state up to (at least) this data becomes available, i.e. in the simplest case, up to the completion of the task which is the source of this data. Possibly a data rate (if specified in the xml input file) needs to be taken into account to determine when the data is really available for consumption. Furthermore the convergence of such data links towards a same consumer task has to be understood as a conjunction of precedence relations (this situation occurs in the example presented in Section 5).
Thus, the function of a data link entity is simply to record the indications of data production the kernel gives to it (as a consequence of producer task completions). In accordance with its possible specified rate attribute, it has to determine when it is available for consumption. Then it informs the task list manager of its availability. On the other side, on receiving its consumption indication, it has just to reset its state.
The hardware entities
Up to now, hardware architectures are composed of processors only. In the future, it will be most probably extended to take into account other hardware components which have some impacts on the behaviour and performances of the overall system such as memory banks for instance. Each processor of the considered hardware architecture has its equivalent processor entity. No control is modelled in such an entity but instead it encapsulates some properties stating about its current activity (no operational, idle, busy). In case of a processor with DVFS (Dynamic Voltage and Frequency Scaling) and DPM (Dynamic Power Management) capabilities, the corresponding entity owns additional properties about its current functioning conditions (power consumption mode, voltage, frequency) together with the functions for updating them. However, so as to be as adaptable as possible, it is not for the processor entity to deal with mode or frequency/voltage change decisions. It requires an additional entity in charge of implementing some CPU monitoring laws: for instance, a scheduler entity which implements some energy-aware scheduling algorithm, or a DPM entity able to adjust the processor working status according to its current workload. On the other hand, when such functioning condition changes occur and induce some time penalties, it is the responsability of the processor entity to charge and delay the task execution as a consequence.
The system entities
At the moment, system entities are the task list manager, the scheduler and the power manager.
The part of the task list manager entity is twofold. It controls the very first release of each task: after a successful comparison between a task activation date and the current time, it produces a specific activation kernel message (that is then relayed by the kernel towards the concerned task entity and the scheduler entity too). Moreover it manages functional constraints between tasks that appear when data dependencies have been specified: at each simulation slot, depending on the availability or consumption of data, it produces specific task block or unblock kernel messages. The scheduler entity is in charge of sharing the processor(s) between the ready tasks. Its election rules depend on the scheduling strategy it implements. It owns proper queue(s) about ready tasks and manages them thanks to the task state change indications the kernel sends to it and the tick indication too. The election rules it implements may lead it to run and/or to preempt a task. STORM provides most of the "standard" real-time global event-driven or time-driven schedulers. The other famous scheduling algorithms such as Least-Laxity-First (LLF), Enhanced-Least-Laxity-First (ELLF) and PD2 (Pfair) will be soon available. In any case there is no limitation for introducing new scheduling policies. The power manager carries out the behaviour of the elements of the processor architecture (DPM, DVFS) that may impact the execution. It is responsible for changing the simulation according to the incurred overheads from a time and energy point of view.
The behaviour of the kernel
Due to our discrete approach for the simulation process, the behaviour of the simulation kernel is a cyclic one , one cycle for one slot. After a necessary initial step where all the simulation entities are created and the global time variable is initialized, the loop of cycles is entered. Any cycle is split into successive steps that come down to:
1. manage the watchdogs, i.e. to detect those possible watchdogs (cf. the watchdog service in section 4.1) that expire and to call the specified function of the specified entity;
2. process all the currently pending kernel messages (as said before it leads for the most to inform other simulation entities of the current situation);
3. call upon the scheduler. We underline that it doesn't mean that the scheduling selection rules have to be applied at each slot: that is part of scheduling policy type and implementation choice;
4. manage time passing, i.e. to inform all the simulation entities of a tick (cf. the time service in section 4.1);
5. ask for the task list manager to operate;
6. increment the time variable.
We want to point out that at each step, any of the entities that are concerned may ask for the recording of kernel message(s). Depending on the moment of their creation, those messages are processed either in the current slot, or in the next one. Furthermore, all along its cycles, the kernel records all the event occurrences and associated data that are relevant for building the simulation outputs.
Example
In this section we show the use of STORM for the indepth analysis of a problem. The problem is complex enough because it is about finding the solution that gives the best result (best throughput and best QoS) while consuming the minimum of energy. The example is inspired of a more complex problem: an H264 decoder (image coding / decoding) thas has been studied in the PHERMA project. Due to space limitation, a simplified form is presented here.
The problem
The experiment we have conducted is concerned with the software architecture depicted in Figure 6 (this task graph has been gotten using a suitable command). It is composed of 8 tasks (referenced "Task A" to "Task H") and 10 data links (shown as top-down transitions between tasks). It is a data-flow graph with precedence constraints. The periodic task "Task A" generates periodically (every 20ms) some data on which subsequently the tasks "Task B" to "Task E" work in parallel. Their outputs (by pair) are required as inputs for the execution of the tasks "Task F" and "Task G". Finally, from the joined outputs of these two previous tasks, "Task H" computes some transformation so as to produce the result in the end. This last task is a periodic task because one can thus associate it a deadline that will serve for the measure of QoS (count of missed deadlines). Some parameters of the tasks are given in the Table 1. The tasks "Task A" and "Task H" are periodic (20 ms) and for the experimentations they will be successively of type "PTask WAM" (With Activation Memory) or type "PTask NAM" (No Activation Memory). All tasks have an identical static priority and the scheduling algorithm is of type FP P Scheduleur (Fixed Priority full Preemptive). All the processors are identical and of type CT11MPCore. Our intent is to evaluate what is the minimum required number of processors of a multiprocessor architecture so as to maximise the output flow (i.e. the number of "Task H"'s outputs relatively to the number of "Task A"'s activations), while minimising its jitter (i.e. the variation on the dates where "Task H"'s outputs are produced) together with minimising the electrical power consumption. Such a software architecture as well as our performance evaluation objective are complex enough to prefer a simulation study rather than a (necessary) approximate theoretical one. 
The experiment and results
So as to determine the minimum number of processors, simulations have been successively conducted with an increasing value from 1 up to 5 processors (the performance metrics do not change beyond 4 processors).
So as to conduct a statistical evaluation, 100 simulations (with a total duration equal to 1 second for each) have been achieved during which the computation times of all the jobs of all the tasks have been set randomly between the BCET and WCET values of the tasks (see the Section 4.2.1). During these simulations all the periodic tasks were of PTask WAM type. Then another set of 100 simulations have been conducted, all the periodic tasks being of PTask NAM type. In order to have a measure for the QoS (number of missed deadline of "Task H" ) the deadline of this task has been fixed to 25 ms, so 5 ms after its period. The measure for the throughput is the number of execution of "Task H" during the simulation duration: so the maximum value is 50. 1. whatever the type of task the best results (throughput and QoS) are gotten from 4 processors but the power is then maximal;
2. using the PTask WAM task type one gets more missed deadline, therefore the QoS is less good but the throughput is better, from 2 processors;
3. with the configuration using PTask NAM task type and 2 processors, one consumes 25% less than with 4 processors; the throughput is only 2% lower than the maximal throughput; the QoS is good. It can be a good solution of compromise.
If necessary it is possible to look finely at the behaviour of the tasks using a Gantt diagram. As an example the Figure 7 gotten with PTask NAM task type and 2 processors shows the deadline failure of "task H" during the first period. This missed deadline occurs because it is not possible to have a high degree of parallelism during this period. 
Conclusion
In this paper, we presented the simulator component of STORM, which is an under-development framework for real-time multiprocessor scheduling evaluation. The intent of our work is in the end to evaluate and compare schedulability performance and energy efficiency of scheduling algorithms and to investigate their adequacy with regard to the hardware and software architecture features of the systems. Presently, the STORM simulator is able to simulate accurately the execution of a set of tasks over a multiprocessor system. Tasks may exhibit various behaviors and be independent or not. Various scheduling strategies are supported too. As a result, the simulator is not only able to state about the schedulability of the studied system but also to characterize its behavior with some measurements for further analysis. It is a flexible, portable, and open tool. At first, the STORM software is freeware under Creative Commons License; subsequently it will be available as an open source. At the present time, our work on STORM is concerned with:
• The energy and power aspects: i) we are expanding the hardware component library with processors supporting DVFS and DPM; ii) we are studying how to introduce banked memory patterns; iii) we are integrating the corresponding energy/power management strategies as new system entities;
• The memory architecture modelling: since memory time access can act significantly upon the global performances of a multiprocessor system, we are investigating how to model the behaviour of (different levels of) cache and external memory and in particular the overhead induced by their management;
• The evaluation process automation: we are defining a simulation language to drive the experiments. Thus the user would be able to describe a "domain" for the simulation inputs and to specify the metrics to be measured as a result. Then the tool would automatically build a statistically representative set of compliant test cases, run their simulations, and output the statistical results. Indeed the experimentation presented in section 5 was done using a very simplified version of this language.
