Abstract-This paper presents a complete case study -named ROSACE for Research Open-Source Avionics and Control Engineering -that goes from a baseline flight controller, developed in MATLAB/SIMULINK, to a multi-periodic controller executing on a multi/many-core target. The interactions between control and computer engineers are highlighted during the development steps, in particular by investigating several multi-periodic configurations. We deduced ways to improve the discussion between engineers in order to ease the integration on the target. The whole case study is made available to the community under an open-source license.
I. INTRODUCTION
The purpose of the paper is twofold: first, to provide an open-source avionic control engineering case study 1 that can be used as a benchmark, and second, to illustrate a way of translating such a high level SIMULINK [1] specification down to a multi-threaded code executing on a multi/many-core target that is compliant with the high level requirements This case study is analyzed with respect to real-time implementation and ways to reduce as much as possible the effort on the integration while preserving the correct behaviour.
A. Design of a parallel flight controller
We rely on a standard avionic development process but use recent languages and tools to design a parallel flight controller on a challenging to embed target. It is of paramount importance to prepare the embedding of multi/many-core COTS [2] , [3] because they will be the only available processors on the market and because they dramatically lack of predictability. The design of a flight controller works as follows:
Step 1: Production of a multi-periodic controller. A multiperiodic flight controller is developed in SIMULINK around a given operating point [4] . The methodology to obtain such a controller is described in Section II-A. Controllers are usually verified and validated against several properties (i.e. stability, performance, robustness). Since our objective is to validate the real-time aspects, we mainly focus on time-domain performance specifications on both the transient response and the steadystate response. Four types of properties are analyzed on the system response to a step input: P1 : settling time, that is the time required to settle within 5% (resp. 1% or 2%) of the steady-state value; P2 : overshoot, that is the maximum value attained minus the steady-state value; 1 The complete case study can be found on the svn repository https://svn. onera.fr/schedmcore/branches/schedmcore-RTAS2014/Case_Study_RTAS. P3 : rise time, that is the time it takes to rise from 10% to 90% of the steady-state value; P4 : steady-state error, that is the difference between the input and the output for a prescribed test input as t → ∞. The time-domain performance properties are illustrated in the figure 1 for a step input. At this stage, these properties are analyzed through SIMULINK simulations. Step 2: Coding. The discrete SIMULINK specification is then translated within the PRELUDE/SCHEDMCORE framework. To do so, each block, executing at a given rate, is translated as a sequential C code and the multi-periodic assembly is translated into a PRELUDE program [5] . Currently, those translations are manual but future work could consider automatic translation using tools detailed in Section V.
The designer can then simulate the code with the SCHEDM-CORE toolbox [6] . The code has been instrumented in order to generate SIMULINK-compliant traces, so that the designer can compare the tracings obtained by the simulation of the implementation with those of the high level design. Several assembly versions can be constructed by varying the periods and the precedence constraints in order to ease the integration. This stage is described in Section III.
Step 3: Validation on the target. Finally, the designer can integrate the implementation on the real target. To do so, the multi/many-core must be used in a predictable way by relying for instance on an appropriate execution model [7] . Such a model is a set of rules to be followed by the designer in order to avoid, or at least reduce, unpredictable behaviours. In this work, we reuse some ideas from the literature: off-line non preemptive partitioned schedule, static storage of code and variables in the caches, explicit communication using the network on chip (NoC). The experiments have been made on the TILERA TILEMPOWERGX-36 platform [8] .
To validate the performances with regard to the environment dynamics, there are mainly three approaches: (1) hardware-inthe-loop validation; (2) connecting the controller executing on the multi/many-core with the SIMULINK dynamic model; (3) implementing the dynamics model on the multi/many-core as well with a sufficiently high frequency to represent a continuous dynamics. We have chosen the last solution because the timings to interconnect the controller and the aircraft dynamics on the many-core are small and bounded, while interfacing the target with SIMULINK would not be easy to prove correct. Again tracings obtained during the real execution are compared with the high level requirements. This stage is depicted in Section IV.
B. Lessons learned
The described experiments helped us improving our understanding of the difficulties to (1) discuss between control engineers and computer scientists and (2) to highlight the link between the high level design and the low level real-time choices.
a) Where do antagonist requirements come from?: From the control engineers point of view, the more close to the real dynamics the controller is, the more confidence he will have in the result. Being close to the real dynamics means executing controller sub-functions in sequence as fast as possible (generating precedences) and as often as possible (generating high frequencies). This results in very strong realtime constraints for the integration. From the integrators point of view, the less severe the real-time requirements are, the safer the integration will be. Indeed, in practice, reducing frequencies decreases the CPU usage, freeing time for other applications. Reducing precedences among tasks increases the schedulability. Therefore, a compromise between both sides must be found.
b) Ease the discussion: The taxonomies and concerns differ in the two worlds. Control engineers consider (1) no resource limitation in general. They are however aware that delays will be introduced by the communication network (between sensors/actuators and CPUs) and that restrictions on the frequencies will be imposed by the CPUs; (2) properties such as stability, robustness and performances; (3) validation and verification on the closed loop. Computer engineers consider (1) mainly the provisioning of the finite resources; (2) local properties such as WCET computation, schedulability and response time; (3) real-time analyses on the controller solely. In particular, they do not handle the high level properties, such as the properties P1-P4 of our flight controller.
There are two ways to ease the discussion. The first consists in providing tools and methods to the control engineers to precisely determine the real-time behaviours of the controller (WCET or schedulability). Such an approach does not exist yet, but there are good practices. For instance, standard controllers avoid jitter because they are supposed to degrade the control performance. The second approach consists in maintaining a common view during the development by taking into account the high-level properties at each development step. We follow the second way by offering a first common information in the form of tracings. This allows to quickly analyse the behaviour of several low-level designs to check if the performance properties are still valid. c) Where compromise can be found: A civil flight controller is quite robust and can support a relaxed implementation as illustrated in the paper. In the future, flight controllers will be more reactive due to the use of composite structure, the reduction of fuel consumption and the intensification of the traffic. On the other hand, achieving a predictable implementation on next-generation processors will be rather difficult. These developments will increase the role and the complexity of finding a compromise. Therefore, the design of flight controllers will require more automatic methods and tools.
II. CASE STUDY: LONGITUDINAL FLIGHT CONTROLLER
We consider the longitudinal motion of a medium-range civil aircraft in en-route phase, specifically the cruise and change of cruise level subphases [9] . During the cruise subphase, the autopilot maintains a constant altitude h (actually a specific flight level FLxxx 2 ) while the autothrottle (A/T) maintains the airspeed V a . During a change of cruise level subphase (i.e. a step climb), the autopilot commands a constant vertical speed V z (rate of climb), till capturing the new flight level. These changes of flight level are mainly for fuel economy reasons; the flight management system (FMS) executes step climbs of 2000 ft, or even 4000 ft, when appropriate (e.g. FL300 → FL320 → FL340 → FL360, figure 2 ). Step climbs in en-route phase A. Recap on flight control system design The electronic flight control system remains a challenging part of an aircraft design. As the aircraft dynamics vary significantly within its flight envelope 3 , a single static controller is generally insufficient to ensure stability and performance on the whole operating domain. To this end, the controller must somehow "evolve" with the flight condition [9] . For decades now, the engineers have resorted to gain-scheduling techniques to design electronic flight control systems [10] , [11] . Essentially, the gain-scheduling approach consists of choosing a finite set of operating points (i.e., flight conditions) distributed throughout the flight envelope and designing a corresponding set of linear controllers to locally achieve stability and performance. Afterwards, to fully cover the operating domain, these linear controllers are interpolated with scheduling variables representative of the flight condition. The overall stability and performance are finally assessed by different mathematical methods and extensive time and frequency validations (e.g., Monte-Carlo method). This is nevertheless out of the scope of the paper.
As the common practice in automatic control is to design continuous-time (analog) controllers, the flight control laws must be digitalized in order to be implemented on the onboard computers. This implies the choice of adequate sampling periods. First the sampling period must be lower than the system delay margin, that is the maximum pure delay that the system can withstand before destabilizing. Moreover, to preserve performance, one should ideally choose a sufficiently low sampling period to reproduce as much as possible the behaviour of the continuous-time controller.
B. Description of the case study The case study is a multi-periodic extension of the monoperiodic longitudinal control of Gervais and al [12] . A simple yet representative longitudinal flight controller has been designed in the MATLAB/SIMULINK environment for the flight condition (h = 10000 m, V a = 230 m/s) which corresponds to an average cruise condition. The controller was verified in the continuous-time domain by studying the behaviour of the aircraft in the neighbourhood of this flight condition. However, the controller is not scheduled, meaning that it is likely to perform poorly far from this flight condition.
The SIMULINK scheme in Figure 3 is actually the discretization of our original SIMULINK scheme. It is divided into two parts: on the one hand, the Environment Simulation part represents the real system that is to be controlled, that is the aircraft as well as the engines and elevators, and, on the other hand, the Controller part gathers the control loops (altitude_hold, Vz_control, Va_control) as well as filters. The goal of the longitudinal flight controller is to track accurately altitude, vertical speed and airspeed commands (resp. h c , V zc and V ac ). The airspeed control is handled by the Va_control loop that maintains or tracks the desired airspeed V ac . The altitude control is split in two stages; an altitude command h c is first translated into a vertical speed command V zc by the altitude_hold loop and the Vz_control loop then tracks V zc . During a step climb, the controller logic is as follows: a constant vertical speed command (V zc = 2.5 m/s is first imposed so the aircraft gains altitude, then, within 50 m of the target flight level, the controller switches back to the altitude hold function to capture the commanded altitude and to travel the last meters. This ensures a climb at a low constant flight path angle, so the passengers will not experience any discomfort. Without this logic, a very steep climb could result from a direct altitude demand.
The considered outputs are listed in Tab. I and are measured by dedicated sensors. They are modelled as low-pass filters with bandwidth reflecting the nature of the measured signals. Control engineers do not resort to the same approach to digitalize their controllers. Indeed, from a programming perspective, it is inconvenient to implement a controller with a numerical integration routine such as Runge-Kutta method. Moreover the discretization must preserve frequencydomain characteristics as much as possible, so the performance and stability requirements are still met. Therefore, dedicated techniques [13] other than numerical integration are used to convert a continuous-time controller K(s) to its discretetime version K(z); these techniques all lead to difference equations. Among these techniques, the bilinear transformation (also known as Tustin's method) and the zero-order holder method are the most popular ones. Moreover filters with specific properties (e.g., bandwidth) can be designed directly in the digital domain.
Rate choices
The closed-loop system with the continuous-time controller can roughly tolerate a pure time delay of 1 s before destabilizing. The sampling period must then be chosen lower than 1 s (1 Hz rate). Nevertheless, as we are not only interested in preserving stability but performance as well, the sampling period should be much lower, for instance 100 ms (10 Hz rate). Considering realistic rates used in industry, the three controller blocks are first digitalized with a 20 ms sampling period (50 Hz rate) whereas the filters work at a rate of 100 Hz to feed the data. Finally, as the environment (aircraft+elevator+engine) is supposed to model a continuous-time phenomenon, a greater rate of 200 Hz is used.
C. Validation objective
The design process first focuses on the internal V a and V z loops (resp. Va_control and Vz_control blocks). We analyse the properties P1 to P4 for separate step demands in V a and V z . Moreover the two outputs should be decoupled, that is a demand in V a should slightly affect V z , and vice versa. The steady-state error (P4) for the decoupled approach is considered as correct. This property is however analysed on a step climb. Figure 5 illustrates a step climb of 1000 m asked at t = 50 s. During the first 50 seconds, the aircraft maintains an altitude of 10 km and an airspeed of 230 m/s. As the new commanded altitude (11 km) is too high, the autopilot first commands a constant vertical speed of V zc = 2.5 m/s (top right). The aircraft begins its ascent (top left) at constant vertical speed (bottom right). At 10950 m (t = 437 s), the controller logic switches back to altitude hold mode and smoothly brings the aircraft to 11000 m with very slight overshoot. During the whole maneuver, the airspeed V a stays around 230 m/s as desired. 
III. IMPLEMENTATION
This section describes the coding in C+PRELUDE. We illustrate how the control and computer engineers can interact in order to simplify the integration. This can be reached by investigating several multi-periodic configurations where variations are made on the frequencies and the precedence constraints.
A. Coding of the basic blocks
Each basic block is manually translated as C code in order to obtain a simple coding and a complete traceability. However, any automatic translation could work as long as the code can be parametrized by the sampling period T s . Note that we use the same discretization methods as those selected in the SIMULINK model.
The components of the Environment Simulation are discretized with Forward Euler method, as it has a much simpler form than any other integration method. Moreover, the results are sound with this approach. The sampling period T s is explicitely represented with a fixed integration step ∆ = 0.005 ms (for 200Hz).
The discretization of the three controllers is simple as the only dynamic element is an integrator 1/s, which is usually discretized with forward difference T s /(z − 1). Therefore, the sampling period T s is explicit.
The discretization method that we used for the filters is the zero-order hold approximation. As before, the discrete models are implemented as difference equations, the coefficients of which depend on the sampling period T s . The relationship between coefficient and T s is complex, which means that the discretized filter must be computed again for a new choice of sampling period.
B. Coding of the assembly
The multi-periodic code generated by the SIMULINK toolbox contains too many implicit choices. A better suited approach should provide (1) explicit description (e.g. of the communication) and (2) independence with regards to the real execution, in the sense that the functional results must always be the same for a given input whatever the real-time execution applied by the executive layer.
PRELUDE [5] 4 is a formal language designed for this purpose. It belongs to the category of synchronous data-flow languages [14] and focuses on the real-time aspects of multi-periodic systems. From a PRELUDE program the compiler generates a set of dependent periodic tasks that preserves the semantics of the original program. The preservation of the semantics is warranted so that consuming task instances always use data produced by the correct producing task instance. This property is ensured thanks to two mechanisms: first, precedence encoding enforces that a consuming task cannot execute before the end of the producer, and second, a communication bufferbased protocol similar to [15] is implemented.
PRELUDE reuses many concepts from the synchronous dataflow language LUSTRE [16] . The variables and expressions of a program denote infinite sequences of values called flows. Each flow is accompanied with a clock, which defines the instant during which each value of the flow must be computed. PRELUDE follows a relaxed synchronous hypothesis (introduced by [17] ), which states that computations must end before their next activation. A program consists of a set of equations, structured into nodes. The equations of a node define its output flows from its input flows. It is possible to define a node that includes several subnode calls executed at different rates.
PRELUDE aims at integrating functions that have been programmed in another language. These imported functions must first be declared in the program. All the basic blocks of the case study are in particular declared as imported nodes. The syntax is the following:
imported node V a _ f i l t e r ( Va : r e a l ) r e t u r n s ( Va_f : r e a l ) wcet X It consists of the signature of the node (type and number of inputs and outputs) with a WCET. At this stage we may not know this value, so we keep it undetermined. Imported node calls follow the usual data-flow semantics: an imported node cannot start its execution before all its inputs are available and produces all its outputs simultaneously at the end of its execution. Those imported nodes become the tasks populating the task set. 4 The PRELUDE compiler is available for download at http://www.lifl.fr/˜forget/prelude.html PRELUDE adds real-time primitives to the synchronous dataflow model. Those operators can decelerate, accelerate or offset flows. Real-time operators are formally defined using strictly periodic clocks. A strictly periodic clock is a sequence of instants that can be defined as a pair (period, offset). The basic clock, defined as (1, 0), is the fastest clock and all strictly periodic clocks are derived relatively from it. We choose the basic clock with a period of 100µs. This choice is left to the integrator: it must take into account the real-time attributes of the tasks and the performance of the multi/manycore target. In our case, the WCETs are very low (couples of µs), communication on the NoC takes less than 35µs (if the mapping respects some rules detailed in Section IV), and the periods must be a multiple of the basic clock. Therefore the tightest basic clock is 50µs. Any other basic clock must be greater than 50µs (imposed by the platform) and must be a divisor of 5000µs (imposed by the application). It can be useful to reduce the basic clock if the schedule fails since WCETs are multiple of the basic clock.
The reference inputs h c and V ac become inputs flows of the node. Those inputs are assumed at 10Hz (cf Figure 3) and are therefore associated with the clock (1000, 0). The assembly expressed below is equivalent to the SIMULINK design. The commands produced by the controller, δ ec and δ thc , become outputs of the node. It is mandatory to express the rate of the input while the rate of outputs is inferred by the compiler. All other variables are declared as intermediate flows. Then, after keyword let all the equations are written. The first h f = h_filter(h/ˆ2) states that the node h_filter produces the variable h f and consumes the flow h/ˆ2 which is the deceleration by 2 of the flow h. Since h is produced by the node aircraft_dynamics, we can deduce that h_filter runs twice slower than aircraft_dynamics. This respects the proportionality of the frequencies in Figure 3 where aircraft_dynamics is at 200Hz and h_filter at 100Hz. We can compute the clock of h_filter from the sixth equation: V zc = altitude_hold (h f /ˆ2, h c * ˆ5). The node altitude_hold consumes h c * ˆ5; thus it runs 5 times faster than the input h c . Since the node altitude_hold consumes also h f /ˆ2, we deduce that altitude_hold executes twice slower than h_filter. We deduce from the equations the following relationship (compliant with Figure 3 
C. Variations in the assembly to relax real-time constraints
This first specification is very constrained in term of realtime and the effort of integration will be stringent. Indeed, we have reduced the frequencies by implementing a multi-periodic controller instead of designing a mono-periodic that runs at 200 Hz. This decreases the CPU usage but the precedence constraints imposed by the communication force several tasks to execute in a very short interval. For instance, the functional chain h → h f → V zc → δ ec → δ e produces the execution shown in Figure 6 . The sequence of tasks aircraft_dynamics, h_filter, altitude_hold, Vz_control, elevator must execute in less than 5ms every 20 ms. Relaxing precedences. To relax the effort of integration, the second precedence patterns in Figure 7 are better suited. Indeed, the system becomes more parallelisable and/or WCET of tasks with lower frequency can be increased. The unique difference for coding the assembly in PRELUDE (named in that case assemblage_v2) stands in the two equations: T = e n g i n e ( ( 1 . 6 4 0 2 2 2 2 9 6 1 6 2 3 1 6 f b y d e l t a _ t h _ c ) * ^4 ) ; d e l t a _ e = e l e v a t o r ( ( 0 . 0 1 8 6 4 5 9 1 8 1 2 3 7 1 6 Note that producing the same behaviour in SIMULINK is rather complex. Note also that breaking the dependencies is done with the fby operator, borrowed from LUCID SYNCHRONE [18] , that delays a flow by one period. Relaxing periods. As illustrated previously, modifying the assembly is very easy in PRELUDE. Since it is also simple to generate the tracings, the control engineer can easily verify if the controller performances are fulfilled. This opens the room for testing many solutions.
In this paper, we will try three more assemblies that reduce periods and precedences. Next assembly (named assemblage_v3) reduces some frequencies: h_filter, Va_filter, Vz_filter are at 50Hz and altitude_hold is at 10Hz. For the three filters, we have to change the rate of the input flow as follows: In this version, dependencies are direct as in Figure 6 , while assemblage_v4 reuses those of Figure 7 . In assemblage_v41, we changed the dependencies while keeping the frequencies. In assemblage_v5, we put Vz_control and Va_control at 25 Hz.
D. Validation of the coding
All the previous assemblies have been simulated on a PC using SCHEDMCORE framework. To do so, we rely on the simulator lscm_run-nort provided with SCHEDMCORE 5 . To assess the correctness of the various assemblies, we compared the time-responses to the same step climb simulated. Figure 8 superimposes the results of SIMULINK ( Figure 5 ) and those of all the assemblies for the same autopilot instructions. Overall, there is no significant degradation and only the settling time is slightly impacted by the modifications. Indeed, we play on the rates and precedences that impact the response-times but not the functional values (no significant impact on properties P2 and P3). Finally, we observe that the controller is robust to the relaxations. Since all results are satisfying, the different assemblies can be integrated on the target and assessed at this level.
IV. EXPERIMENTS
This section illustrates the feasibility of implementing the parallel flight controller on a multi/many-core. This shows that the abstract hypotheses made at the SIMULINK and the PRELUDE levels are reasonable. In particular, the basic clock, the relaxed synchronous hypothesis and the communication protocol can be implemented. We therefore describe in this section the porting of the code on the TILERA TILEMPOWERGX-36. We impose an execution model to enforce predictability on the platform in order to respect as much as possible the safety avionic standards.
A. TILERA description
The platform is equipped with a tiled microprocessor composed of 36 tiles. Each tile can communicate with other tiles, the peripherals and the external memory through a network on chip (NoC). The external memory is composed of 32 GBits DDR3 accessible through memory controllers. The grid is a 6x6 matrix of tiles as shown in the figure below (extracted from TILERA documentation [19] ). Each tile is composed of the Figure 9 . Scheme of the tile grids following elements [19] : (1) a single core clocked at 1.2 GHz that owns two levels of cache (L1I-32KB, L1D-32KB, unified L2-256KB), (2) a switch that manages the communication over the network on chip, (3) a local clock accessible through the TILERA API get_cycle_count. The NoC is composed of five full-duplex sub-networks, each devoted to a particular type of exchange. The network Shared Dynamic Network (SDN) is the one used for exchanging data between tiles. All communications with the external memory go through the reQuest Dynamic Network (QDN) for write requests and through the Response Dynamic Network for read requests (RDN). Three execution environments are provided with the platform: (1) standard SMP LINUX environment, (2) Zero Overhead LINUX (ZOL) or (3) bare-metal. The TILERA platform is not built for real-time systems but for high performance or networking applications. However, even if the most suitable environment is bare-metal, ZOL offers rather promising realtime predictability. The main features of ZOL [20, Chapter 7] are: 1) processors are isolated from interrupts, like the shielding approach promoted by [21] . The operating system does no longer interfere in the execution unless the application itself makes a system call. There is no interrupt handler; 2) there is a unique thread per processor. This ensures an applicative isolation of CPU and local caches resources. 3) a complete TILERA configuration can mix tiles in ZOL, in LINUX and bare-metal. In our experiments, a unique core is under LINUX to boot the chip and all other tiles are in ZOL, in particular those hosting the application.
Memory management. All environments support shared memory with builtin hardware coherency which may be disabled.
For our experiments, we keep the shared memory active and we use one of the policies offered by TILERA for storing shared variables. The cache homing policy permits to alleviate the workload on the DDR by allocating each shared variable to a home tile. When tile t reads the variable, either the variable is in its own caches, otherwise instead of fetching the variable directly from the RAM memory, it asks the variable home tile. If the home tile has the data in its caches, it sends it directly to the requesting tile t. Otherwise, it is the home tile duty to fetch the variable from the RAM and then to send it to t. The writing works also by interacting with the home tile and invalidating the local caches containing the old value. Such pattern of exchanges is illustrated in Figure 10 , extracted from TILERA documentation [20, §6] . Stressing benchmarks. We made several benchmarks to assess the predictability of the TILERA platform. We first analysed the impact on the execution times when several tiles access concurrently the shared resources, such as DDR, local caches and NoC. We particularly focused on the time to access the local clock, to read/write data with the shared memory policies. From the experimental observations, the mean time to access the local clock is 60ns and the maximum time is 400ns. The maximal value is rarely observed, around once every 10000 accesses. But we need to consider this value as the worst case. There is a real impact on the read and write access times when the number of concurrent tiles exceeds some bounds. Below these bounds, the times are low and repeatable. Above the bounds, a memory access can be delayed more than a second. To determine the bounds, we apply quite the same stressing benchmarks as for the Intel SCC [22] . An example of a benchmark is: several tiles (from 2 to 36) modify a shared variable hosted by a home tile and we measure the write access times for each writer, including the home tile.
Finally, we can deduce some rules on the mapping to avoid performance and predictability drops: (1) no more than 10 tiles must simultaneously access in writing the same [shared] cached memory location, (2) no more than 5 tiles must simultaneously access in writing the DDR, (3) no more than 30 tiles must simultaneously access in reading the same [shared] cached memory location.
B. Real-time implementation
To start with, the integrator must first assess the WCET of each task. Then, he/she must define an adequate execution model. Finally, a dispatcher compliant with the execution model must be developed for the TILERA TILEMPOWERGX-36.
1) WCET assessment: No static WCET analysis tool, such as ABSINT [23] or OTAWA [24] , is available for the TILERA platform. Therefore, we used a measure-based approach, which is not safe in general but we could hardly do better at this stage. The evaluation was done on each task which was running in sequence and in isolation. We measured the execution time by surrounding the task call between two local clock reads. Because of the variability of the local clock access and to improve the reliability, we added some margin to the observed execution times.
It was decided in Section III-B that the basic clock should run at 100µs. Therefore, WCET must be expressed as multiple of 100µs. Moreover, inputs and outputs in PRELUDE are treated as sensors and actuators. This entails that they must be associated with a WCET. In our case, it could correspond to the delays generated by the bus between sensors, calculator and actuators. We imposed those values. WCETs are given in Tab. III. 2) Execution model: Since we measured task execution times as if they were a sequential code running in isolation, we must use an execution model that fulfills those hypotheses. First, the schedule must be non preemptive to respect the sequential execution. It is best suited to avoid migration to reduce unexpected interrupt. Partitioning also improves predictability since it permits to uses a MIMD ("multiple instruction, multiple data") approach where the created binaries are specific to particular cores.
We tested two mappings, excluding the core 0 dedicated for initialization. Those mappings have been chosen manually. Any valid mapping can be computed using a constraint programming approach or a dedicated heuristic.
• map1: exactly one task on a tile;
• map2: grouping tasks with the same period on the same tile. For assemblage with the strongest precedences, we obtain the schedule shown Figure 11 . Figure 11 . Off-line schedule -map2
Ensuring isolation is much more challenging. We promote the storage of code and data in the local caches as in [7] , [25] . This prevents from unexpected applicative interactions. The tasks of the case study are small enough to fit in the caches. If this was not satisfiable, the designer must decompose, if possible, the code into smaller size pieces. Otherwise, a task that overflows the caches can run concurrently with locally stored tasks but not with other tasks that overflow the caches. The communication is done via shared memory and this contradicts the isolation hypothesis. However, since the mapping respects the bounds highlighted in Section IV-A, we can assume that the effect is negligible on the execution times. Note that, data could be exchanged with a message passing approach by relying on the User Dynamic Network (UDN), the performances of which are good for small size data. The home tile associated to a data produced by task t is the tile where task t executes.
The PRELUDE semantics imposes also to ensure precedence constraints between tasks. To fulfill this constraint, we choose a tick-based approach, that is scheduling decisions are taken only at discrete instants of a chosen granularity. We reuse the tick gap introduced in [22] , in order to cope with the imperfect synchronization of local clocks and the communication delays. The idea consists in leaving a gap between the end of a job's termination and the beginning of the next tick. To do so, we add to each WCET a gap and in our case, the gap is 550ns where 35 ns (communication delay) + 500 ns (clock precision)
WCET given in Tab. III already contain the gaps.
3) Dispatcher implementation: The local clocks of the TILERA TILEMPOWERGX-36 are synchronous (i.e. no clock drift between the local clocks) but they are not perfectly synchronized because they do not boot at the same time. The offsets between the cores are not handled by the hardware and it is up to the user to manage a synchronization if needed. We have encountered the same problem on the Intel Singlechip Cloud Computer (SCC). The SCC bare-metal library we developed provides means to synchronize the core local clocks with a precision of 4 µs [22] . For the TILERA, we used another synchronization algorithm based on barrier that leads to a precision of 0.5 µs. It can hardly be reduced because of the worst time required to read the local clock. But in general, the observed precision is 50ns. The synchronization algorithm works as follows:
• N shared variables are homed on tile 0.
• 1 shared variable is homed on each tile.
• When tile i starts, it sets the N variables to 0. Then, the tile makes an active wait: as long as it did not receive any value on its own variable, it continuously sets to 1 the i-th of core 0.
• Tile 0 works differently. It sets all its homed variables to 0 and waits actively until all tiles are awaken. When this occurs, it sets the variables hosted by the other tiles to 1.
• When a tile detects that its local variable is set to 1, it starts a waiting of 1 s using its local clock. After this second, it reads the current time. This value becomes its local offset.
• Then the shared global time = local time − local offset. Since no migration is allowed and since the task sequence is known in advance by every processor, we do not need timer interrupts for implementing the sequence. Each processor knows the static task sequence it has to run. When a processor needs to wait the next cycle, it does a busy wait (spin).
C. Results
We obtain almost the same results as those observed at PRELUDE level. They are also superimposed in Figure 8 .
V. RELATED WORK
Control command and real-time: [26] , [27] proposed a flexible real-time control system where the scheduler uses feedback from execution-time measures to adjust the periods in order to optimize the performances. Such a solution is not possible on non predictable target such as multi/many-core. The authors of [28] accept that WCETs are not computable, due to the new processors technologies. They analyse the endto-end latencies of the control from sensors to actuators and show that some variability in execution times is acceptable.
Multi-periodic specification: Using a formal language for the description of dependent multi-rate task sets has been advocated by Baruah [29] . As already mentioned, SIMULINK [1] allows the description of multi-periodic systems in terms of blocks communicating through data-flows or events. However expressing complex communication patterns is difficult. Modelica also offers the possibility to describe multi-periodic assemblies [30] . The semantics relies on the use of strictly periodic clocks as in PRELUDE. The authors of [31] described a possible way to write multi-periodic SCADE systems but the toolbox has not been extended to consider such extension. The authors of [32] also consider SCADE extension with finite state automata in order to express multi-periodic systems.
Compilation of multi-periodic specifications: Each basic component can be automatically translated into C code: with SIMULINK toolbox; with GENEAUTO toolbox 6 ; with a certified compiler as proposed in [33] , [34] ; or with an automatic SIMULINK to SCADE translator (as proposed in [35] ), combined with the SCADE suite compiler. The multi-periodic configuration can be translated with SIMULINK, leading to a rate monotonic schedule. We can also mention the work of [36] that provides a sound semantics of SIMULINK operators. The authors of [37] particularly focus on the modular aspects which is a complementary aspect for an automatic translation.
The authors of [38] have implemented a multi-threaded compilation scheme for affine data-flow graphs, which also allow to specify multi-periodic assembly. Authors of [39] have developed a translation of SCADE programs to OASIS implementation.
Concerning the predictable use of multi/many-core for hard real-time, there is a large literature and we can mention the surveys [40] and [41] .
VI. CONCLUSION
We have experimented the design of a parallel avionic longitudinal controller on a multi/many-core target. We have illustrated throughout a series of experiments what kind of discussion between control engineers and integrators can be leveraged, in order to find a compromise between both sides constraints.
The case study will be extended with a comprehensive flight control system that will operate on the whole flight envelope. Additional flight control laws will be integrated to cover the different flight modes. Validation will also be expanded to allow the observation of other criteria and possibly to allow Monte-Carlo simulations.
Future work will consider more automatic translation from the SIMULINK model, for instance by reusing and extending an existing tool [34] . PRELUDE language could also be improved by providing new features such as: (1) retrieving the strictly periodic clocks computed by the compiler in the imported node (indeed, the frequency of a discretized block has an impact on the code); (2) introducing the notion of "don't care" [42] in order to generate many assemblies. The authors of [43] use a MILP approach to determine where to introduce fby to break precedence constraints.
VII. ACKNOWLEDGEMENT

