We present the first end-to-end modeling and compilation flow to parallelize hard real-time control applications while fully guaranteeing the respect of real-time requirements on off-the-shelf hardware. It scales to thousands of dataflow nodes and has been validated on two production avionics applications. Unlike classical optimizing compilation, it takes as input non-functional requirements (real time, resource limits). To enforce these requirements, the compiler follows a static resource allocation strategy, from coarse-grain tasks communicating over an interconnection network all the way to individual variables and memory accesses. It controls timing interferences resulting from mapping decisions in a precise, safe, and scalable way. 24:2 K. Didier et al.
INTRODUCTION
Full automation is possible and needed in real-time scheduling. The implementation of complex embedded software relies on two fundamental and complementary engineering disciplines: realtime scheduling and compilation. Real-time scheduling covers 1 the upper abstraction levels of the implementation process, which determine how a functional specification is transformed into Fully automating the mapping is difficult. The key difficulty of real-time scheduling is that timing analysis and resource allocation depend on each other. This difficulty is particularly acute in the case of safety-critical control systems where certification regulations do not tolerate the weakening of timing characterizations in multi-processor environments; this is the context of our work.
An exhaustive search for the optimal solution not being possible for complexity reasons, heuristic approaches are used to break this dependency cycle. Two such approaches are typical in realtime systems design.
The most common is to first build the system, and then check the respect of real-time requirements through a global analysis. Building the system uses unsafe timing information 2 such as measures, WCETs in isolation plus arbitrary margins, and the like. The second approach is to ensure by construction the respect of requirements. In this case, system construction uses task timing characterizations that are safe for all possible resource allocations (worst-case bounds). The drawback of the first approach is the lack of traceability between resource allocation decisions and timing analysis results. If the system does not respect its real-time requirements, mapping changes are needed, but these changes may also change the timing analysis, and so on, without guarantee of convergence to a solution. The second approach is much more appealing from an automation perspective and considering the fine-grained control it offers to platform engineers; its drawback is pessimism, as all resource allocations are made for the worst case.
So far, the practicality of the second approach has never been established. Automated realtime parallelization flows still rely on simplified hypotheses ignoring much of the timing behavior of concurrent tasks, communication, and synchronization code. And even with such unsafe -an application normalization phase, presented in Section 3.1, which facilitates the mapping of multi-periodic applications through hyper-period expansion, and which enforces a clear separation of inter-node data flow from internal side-effecting computations, enabling to defer all memory allocation decisions to a coordinated memory management and scheduling phase; -an original code generation algorithm, presented in Sections 4 and 5.3, including the generation of the synchronization code; it has been designed to provide mapping-independent worst-case execution time bounds, inaccessible to previous semaphore-based code generation approaches; -new real-time scheduling algorithms capable of orchestrating memory allocation and scheduling, presented in Section 5; the main originality is to provide a safe accounting of memory access interferences through an incremental allocation and scheduling process; these algorithms are scalable, and they rely on list scheduling heuristics inspired from classical compilation.
Besides these technical contributions, we believe this article makes a strong case for a design automation approach combining the strengths of real-time scheduling and optimizing compilation. We validate the benefits of such a combined approach on two production avionics applications. This experimental validation stands out as a milestone in terms of the scale of these applications, and in terms of effectively solving the integrated resource allocation, scheduling, and timing analysis problem on a shared-memory multi-processor.
Outline. The remainder of the article is organized as follows: Section 2 reviews the most closely related work. Section 3 defines the mapping and code generation problem, presenting the functional and non-functional modeling formalism, the target execution platform, and the desired form of the implementation. Section 4 presents the timing model at the core of our method. The mapping and code generation algorithms are presented in Section 5. Section 6 presents experimental results, before the conclusion in Section 7. system satisfies its non-functional requirements (a process known as schedulability analysis) are performed on the completed implementation. The construction of the implementation uses incomplete/unsafe/unformalized versions of the timing analysis algorithms to guide its mapping decisions and code transformations. For instance, the construction of tasks and memory allocation are often guided by potentially unsafe WCET and/or memory footprint estimations, derived from previous experience and partial code analysis. Furthermore, significant parts of the implementation process remain to this day manual or unformalized in many industrial settings.
Recent advances have largely automated the construction of task code and the generation of real-time implementations on specific sequential or multi-core targets. Industrial solutions include Simulink Real-Time from MathWorks, and Scade KCG6 Parallel from ANSYS/Esterel Technologies [41] . Academic results in this direction include Refs [22] , [27] , [44] , and [61] . However, none of these tools provide strong schedulability guarantees when integrating multiple synthesized tasks: separate timing and schedulability analysis must be performed after synthesis. Several methods have been proposed [40, 47, 49] , but they come with the shortcomings of the first approach sketched in the introduction. In particular, in the event of a global non-schedulability diagnosis, it is difficult to pinpoint its source so as to guide subsequent re-engineering efforts.
A few approaches have gone further by letting timing analysis results guide mapping and code generation under simplifying hypotheses common in real-time scheduling, e.g., assuming that task WCET values include overheads related to parallel/concurrent execution. Among these approaches, we cite the industrial tool Asterios Developer from KronoSafe, based on the ΨC language [37] , as well as the academic tools and toolboxes SynDEx [55] , BIP [7] , SchedMCore [52] , Prelude [42] , Lopht [14, 18] , SigmaC [3, 10] , the work on the time-triggered mapping of Lustre [15] , Xoncrete [11] , or the work of Baruah et al. on the synthesis of multi-core cyclic executives [23] . We defer the reader to Ref. [14] for a longer description.
While these methods guarantee correctness and have the potential of providing more feedback in case of non-schedulability diagnostics, the simplifying hypotheses are seldom (if ever) satisfied in practice. To our knowledge, they are satisfied only when using prototype hardware designed for predictability, e.g., by ensuring the absence of memory access interferences [12, 43] . But the hypotheses are never satisfied on modern off-the-shelf multi-and many-cores, where the overheads due to concurrent execution include contributions that may be difficult to estimate, depending on the hardware or software architecture of the system: memory access and bus access interferences, cache-related delays, synchronization costs, scheduler execution time, etc.
A step further is taken in Refs [50] and [53] , where the tool itself adds the needed overheads to the WCET values. In Ref. [53] , overheads are large (several hundred cycles per task), to account for the time-triggered execution mechanism where so-called monitors are used to dynamically update triggering dates. In Ref. [50] , overheads are not even discussed, and no comparison with the sequential case without communication costs is given. More important, in both cases, the objective of the method is optimization, not implementation under constraints, as it is in our case.
Our parallelization method also adds the needed overheads to WCET values. However, unlike in Ref. [53] , it is aimed at applications with fine-grain parallelism like our case studies, where excessive per-task overheads would result in significantly reduced parallelization potential. To keep overheads under control, we make strong hypotheses on the target execution platform, on the form of generated code, and on the integration of the various tools of the back-end. These hypotheses allow our tool-flow to perform a full-fledged timing and schedulability analysis incrementally during allocation and scheduling. Allocation and scheduling are performed jointly, using scalable compilation-like heuristics. The resulting schedule and code are correct by construction. If construction of the schedule is impossible, the partial mapping and schedulability analysis allow the engineer to pinpoint the immediate causes of the scheduling failure. By covering all aspects of resource allocation and code generation, our work is clearly related to previous work on compilation. In previous work [13] , we already noted and exploited the formal and algorithmic proximity between off-line real-time scheduling and various results on software pipelining for super-scalar and very large instruction word (VLIW) processors, where the scheduling burden is mostly supported by the compilers [1, 48] . What fundamentally differentiates our current work from previous compilation work is the choice of performing a safe, worst-case timing analysis incrementally during compilation.
This article does not provide advances on the complexity of real-time scheduling algorithms. Recent papers provided conflicting evidence-pro [20, 45] and contra [26] -on the ability of methods based on constraint solving to find solutions to large-scale real-time scheduling problems. Like most others, the scheduling problem we address can be encoded as an integer linear programming (ILP), satisfiability modulo theories (SMT), or constraint program. However, we consider parallelization, real-time scheduling, memory allocation, and safe and precise timing analysis for very large-scale applications going much beyond the most complex constraint programming and complexity studies. Furthermore, we consider the use of low-complexity mapping heuristics a positive point, as it guarantees scalability.
PROBLEM STATEMENT AND OVERVIEW OF THE APPROACH
Our compilation problem is similar to that solved by a compiler and linker flow for a sequential imperative language: to produce correct executable code that statically orchestrates the machine resources, in a fully automatic and scalable way. But there are also important differences. Following long-standing practice in the avionics industry [28] , the input program-also known as the functional specification-is provided in a dataflow synchronous language with a cyclic execution scheme; we use the Heptagon language [24, 29] , introduced in Section 3.1. Also, the target low-level semantics is multithreaded with explicit resource allocation and mapping for communication and synchronization. Furthermore, as the example of Figure 1 shows, this program can be annotated with non-functional requirements the implementation must respect. In our example, annotations specify real-time requirements (period and deadline). The programming language and the non-functional annotations will be presented in Section 3.1.
We target shared memory multiprocessors with uniform memory access. To facilitate timing analysis, hardware and low-level libraries must satisfy a number of properties detailed in Section 3.2.1. For such hardware, we generate statically scheduled, statically allocated, bare metal code whose structure facilitates timing analysis, along the guidelines of Ref. [47] (further detailed in Section 3.2.2). The threads generated from our example for a dual-core target are presented in Figure 2 . To allow compilation and execution, they must be accompanied by the boot code launching the threads, by the sequential code of the functions implementing the dataflow blocks f, g, Figure 1 for a two-core implementation. and h, by the communication and synchronization library, and by a linker script enforcing memory allocation (provided in Figure 5 ).
To ensure that generated code is not only functionally correct, but also satisfies by construction the real-time requirements, we rely on the compilation flow of Figure 3 . The front-end normalizes and simplifies the input program, bringing it to a form that satisfies the requirements of static single assignment (SSA) form [21] . In the back-end, the sequential code of the basic dataflow blocks (f, g, h in our example 3 ) is separately compiled and analyzed to determine their WCET, worst-case number of accesses to shared communication resources (memory banks), and memory footprint. This information is used in the parallel back-end, which performs real-time resource allocation and code generation, building the parallel threads of Figure 2 and the linker script of Figure 5 .
Of this compilation flow, several components have been extensively studied in previous work: the compilation of dataflow synchronous programs to sequential code [9] , C compilation [39] , and WCET analysis [58] . This article focuses on the remaining topics: the front-end normalization phase in Section 3.1, the parallel back-end in Section 5, and the integration of all back-end tools around the timing model of Section 4, which guarantees by construction the respect of real-time requirements.
Platform-independent Modeling
In safety-critical avionics, the use of synchronous languages is meant to facilitate both the specification and the implementation of systems [28] . One key advantage of these languages is the deterministic execution model where concurrency is permitted inside bounded execution cycles, whereas the cycles are strictly sequenced in time. The de facto industry standard is the proprietary language Scade [28, 41] from ANSYS/Esterel Technologies, itself an evolution of Lustre [16] . Code fragments in this article use the syntax of the open-source Heptagon dialect of Lustre [9, 29] , which implements many of the features of the Scade 6 language and is natively used by our tools [24] .
The Heptagon Language.
Heptagon allows the functional description of systems in a hierarchic dataflow style. The programming unit of Heptagon is the dataflow node, which has inputs and outputs and a (possibly empty) internal state. The execution of a node is cyclic. At each cycle, the node reads its inputs and internal state and computes the value of outputs and the state of the next cycle. The code of a node can be either implemented directly as dataflow equations built of dataflow primitives and instances of other nodes, or provided externally in C following a welldefined programming interface. 4 In Figure 1 (left), node simple has one input i and one output o. It is defined as the dataflow composition of three nodes (f, g, and h) and the dataflow primitive fby introduced below. Nodes f, g, and h are externally defined, their interface being declared in the include Func. Recursion is not permitted in the construction of the dataflow hierarchy-a node cannot be instantiated in its own definition, either directly or through the instantiation of other nodes. This article introduces several extensions to Heptagon to expose parallelism and convey real-time requirements.
Exposing Parallelism. Classical work in real-time scheduling makes a clear distinction between two specification and programming levels:
-Components meant to become sequential code, which are usually known as tasks or runnables. -The system-level specification defines how tasks/runnables interact, exposing the potential parallelism that can be exploited during real-time mapping.
A synchronous language like Heptagon (and Lustre, Scade) does not make this distinction. Its naturally concurrent programming style allows to combine both levels as a single dataflow hierarchy going all the way from system level to low-level machine operations. However, expressing parallelism of too fine a grain generally proves unprofitable for performance and unsuitable to harness in a real-time embedded context. This is primarily due to the synchronization and runtime scheduling overheads, and also to limitations of the timing analysis and parallelization algorithms. For this reason, we need a mechanism to specify which part of the parallelism of a Heptagon program is exposed to the parallelization algorithms.
We shall make the convention that parallelization algorithms only exploit the parallelism of the topmost node of the application and of the nodes that are inlined into it. Inlining is specified with a specific keyword. In Figure 1 (left), the node simple is inlined into node main (the topmost node of the Heptagon program), meaning that the dataflow exposed to parallelization algorithms is the one pictured at right. We call this part the integration program. All other nodes (in our case f, g, and h) are compiled to sequential code.
As the integration program corresponds to a system-level specification, it is required that it has no inputs and outputs. Input and output operations are performed by some of the node instances, which sample hardware devices. In Figure 1 , read_int and write_int are dedicated I/O functions, declared in library Io. Their implementation only requires that variables i and o are placed at specific addresses in memory.
Other ways of exposing parallelism exist. In Refs [19] and [41] , the convention is that compilation is sequential except for groups of nodes (identified through program annotations) that are executed in a fork/join fashion. This may be better adapted to exposing (regular) data parallelism, but seemed less adequate to represent the unstructured, fine-grained parallelism of our two case studies. In addition, note that these works do not model real-time requirements.
Real-time Requirements.
Heptagon is a functional programming language for synchronous, real-time reactive systems. Non-functional requirements constrain how and when the computations of the Heptagon program are executed. We specify them through annotations (in red in Figure 1 ) that do not change the functional semantics of the Heptagon program. We have already introduced the inlining annotation. We shall use three more annotations to represent real-time requirements: period, release date, and deadline. This choice of real-time requirements is not new [14] . The only novelty here is that we represent the requirements with annotations of the Heptagon program. The Heptagon language, extended with these four annotations, allows us to represent the platform-independent specification of our system.
The period of the system is represented with an annotation of its topmost node. We require that all the computations and communications of cycle n of this node (including all computations and communications of nodes it instantiates) are executed after date n × p and before date (n + 1) × p. 5 Release date and deadline requirements can be set on each of the statements of the integration program. If the period of the specification is p and if the release date and the deadline of a statement are, respectively, r and d, with 0 ≤ r < d ≤ p, then the execution of the statement inside cycle n must happen between dates n × p + r and n × p + d. When applied to an inlined node instance, these requirements are transferred onto all the statements of the inlined node.
The timing requirements do not specify the time unit (ms, μs, CPU cycle, etc.). It is the task of the engineer to ensure that all values use a consistent unit. In Figure 1 , the application period is 3,000 time units, and node f has a deadline constraint of 1,500 time units.
Exposing Node
States. The state of a node consists in all values passed from one cycle to the next. The state holder primitive is fby, used in line 9 of our example. The primitive produces on its unique output the value it acquired on its unique input in the previous cycle. During the first execution cycle, fby outputs the initialization constant (0 in our example). The state of node simple consists in the state of its fby statement, plus the state of all nodes it instantiates (in our case f, g, h).
When the state of a node is empty, we say that it is a dataflow function. It is always possible to transform a stateful node instantiation into a dataflow function instantiation. The transformation is done by exposing the node state using dataflow variables and an fby primitive. Assume the original node instantiation is: (o 1 , . . . , o k ) = f (i 1 , . . . , i n ). To perform the transformation, we need to explicitly define three Heptagon objects: the type of f's state, denoted f_state, the initial state of f denoted f_init (of type f_state), and the function f_step, whose signature extends that of f with one input and one output of type f_state. This function, called the transition function of f, computes the new state and outputs of f starting from the current state and inputs, but does not need an internal state because the state is explicitly represented with an input and an output. The instantiation of f can then be replaced with:
(s,o 1 ,. . . ,o k ) = f_step(ps,i 1 ,. . . ,i n ) ; ps = f_init fby s ; The compilation of various Lustre dialects to sequential code [9] always builds these three objects, but only at C level. What we do is move this transformation up to dataflow level, as a source-tosource transformation. Resulting C code preserves its efficiency.
Transforming all node instantiations of the integration program into function calls has the advantage of exposing all data memory used by the application under the form of dataflow variables and fby primitives. We can then perform memory allocation at the level of the dataflow semantics of Heptagon.
Hyper-period Expansion and Normal Form.
The Heptagon language provides conditional control constructs allowing the representation of complex execution modes, data-dependent control, and multi-period execution. We assume in this article that multi-period specifications come under the form of Heptagon programs like the one in Figure 4 (left), where (strictly) periodic activation is represented with periodic counters. In this integration program, nodes n and m are, respectively, executed with periods 200 and 100. The period of the topmost node being 100, the multi-period behavior is achieved by using the Boolean variable cb to ensure that n is executed every other cycle. Line 5 ensures that the value of cb alternates between false and true. Expression "x when cb" is present with the value of x only when cb is present and true, so that n receives input and executes only in cycles where cb is present and true. The over-sampling allowing communication of values from n to m is achieved in lines 8 and 9 using operator merge. For more information on the syntax and semantics of the language operators, readers are deferred to [9, 29] .
Note the high complexity of the sub-and over-sampling operations (even for a Heptagon program defining only two tasks), and the fact that the execution of n is subject to the implicit release date and deadlines imposed by the period of 100, meaning that its execution cannot take more than 100 time units, even though it is executed with a longer period. For these reasons, such a presentation of a multi-period specification may not be a good input for mapping algorithms. This is especially true when scheduling is performed offline and its output is a scheduling table whose length is the hyper-period of the system, i.e., the least common multiple of the node periods. In this case, common in industrial contexts, we believe that a better model of the application can be obtained by "unrolling" the cycles of the integration program over the length of the hyper-period (twice for the example in Figure 4 ), and then simplifying the expression of the periodic activations and sub-and over-samplings. We call this transformation hyper-period expansion.
Hyper-period expansion can also be accompanied by a relaxing of the release date and deadline requirements for nodes with larger periods, according to application-specific rules. For instance, in Figure 4 (right), the release date requirement on node n could be removed (replaced with the implicit one of 0), potentially allowing its computation to take more than 100 time units.
When unrolling creates multiple instances of the same node, like for m in our example, the state of the node must be exposed as explained in Section 3.1.4, to allow its transmission between instances. We actually expose the state of all stateful nodes, which facilitates memory allocation. The result of hyper-period expansion and state exposal is the normalized integration program of Figure 3 .
Normalized programs satisfy the properties of SSA form [21] -there are no hidden data dependencies, each variable is assigned exactly once, and every variable is defined before it is used in a cycle. Thus, it is a good starting point for optimized resource allocation (e.g., memory optimization, using register allocation techniques).
The complexity of the normalization step is low: exposing node states is linear in time and space; the ratio between the size of the output of hyper-period expansion and its input is in O(MAF/MIF), where major frame (MAF) and minor frame (MIF) are defined in Section 6; inlining is simpler than classical C macro expansion.
Parallel Execution Platform
In avionics systems, the respect of execution time bounds must be demonstrated for normal conditions, but the system must also be robust to errors. To ensure robustness, we rely on event-driven semaphore-based synchronization. This guarantees the functional semantics in the presence of timing errors. It also ensures some degree of runtime robustness to timing errors through scheduling elasticity. 6 The respect of execution time bounds for normal conditions can then be achieved through tight control of scheduling, memory allocation, and synchronization [12] . This is different from time-triggered approaches [15, 20, 37] , which place timing predictability first, potentially at the expense of functional determinism and robustness.
3.2.1
Hardware. The method we propose can be applied to shared memory architectures satisfying two hypotheses: (H1) processing cores allow the computation of static WCET bounds for code executed in isolation and (H2) WCET bounds in the presence of interferences on shared resources can be computed as the sum between the WCET bound in isolation plus a term depending on the amount of accesses to the shared resources. This is always possible for architectures that are fully timing compositional [60] .
While the design of such architectures is an active field of research (see, e.g., Refs [12] and [43] ), we are aware of only two compliant processors that are available off-the-shelf-the Kalray MPPA many-core series [33] and the Infineon Aurix TC27x tri-core series [31] .
We focus in this article on the architecture of a single compute cluster of the Kalray MPPA Bostan many-core. This simplifies the definition of the timing model because: (a) there are no shared caches, (b) L1 caches have an LRU replacement policy and no hardware coherency, (c) the interconnect is a full crossbar with fair arbiters, and (d) without interferences, memory accesses take the same time regardless of core and memory bank (uniform memory access). The timing model of this architecture, defined in Section 4, can be easily extended to cover the non-uniform memory access architecture (NUMA) of the TC27x (which can also be configured to satisfy properties (a)-(c)).
A compute cluster of the Kalray MPPA 256 many-core has 16 computing cores and 2MB of static RAM divided into 16 banks of 128kB each. Each core has 32kB of 8-way set-associative instruction cache and 8kB of 2-way set-associative data cache. Special hardware devices allow hardware synchronization of very low latency between the computing cores (a few clock cycles) without the need to perform memory sampling, as classical multi-core spinlocks do.
Software Organization and API.
To comply with industrial requirements, our tool flow produces statically scheduled, statically allocated, bare metal implementations. The code we produce is showcased in Figures 2 and 5 . Each CPU is assigned one sequential thread-a function that never terminates and that is never preempted. This function consists of an initialization section followed by an infinite loop. A global barrier synchronizes the starts of the loop bodies for all CPU threads, so that execution advances in lockstep on all CPUs. Each iteration of the loop bodies performs one execution cycle of the normalized integration program. In Figure 2 , the global synchronization code of both threads is contained in the yellow box. The initialization section sets the state variables and semaphores to their initial values.
Real time. In Figure 2 , the time_wait call before the synchronization barrier of thread 0 is traversed exactly 3,000 time units after either the initialization of the program or the previous call to time_wait. It ensures that the synchronization barrier is traversed with the period prescribed by the integration program of Figure 1 . Calls to time_wait can also be used inside the loop body to enforce release date requirements (not present in our example). The compilation process presented in this article ensures that control always reaches a call to time_wait before the specified timeout elapsed. In other terms, no deadline is missed. In contexts where the execution platform cannot provide static timing guarantees or when explicitly required to do so (e.g., for certification purposes), calls to time_wait can be used to enforce deadline requirements of the integration program. In this case, the implementation of time_wait must be extended with code that detects and handles deadline misses.
Locks. Thread synchronization during one loop iteration is performed using a fixed set of locks.
They are simplified versions of the POSIX or C++11 mutexes [8] that can be given a time-predictable implementation using specialized hardware. When a thread calls unlock(l), the state of lock l becomes true. A call to lock(l,c) waits until the state of l is true and then changes its state back to false. The c argument identifies the requesting CPU to avoid obtaining it at runtime. Like in C++11, behavior is undefined when calling unlock on an already true lock, and the choice of thread to unlock is not specified when two or more lock calls are active on the same lock. Our compilation process will ensure that these two conditions never occur.
Barriers. We do not use synchronization to isolate computation from communication phases, either in the execution of the whole system (as in approaches based on the bulk synchronous parallel (BSP) model [41, 57] ) or in that of individual nodes (as in Ref. [49] ). Doing so would enforce space/time isolation during computation phases, which largely facilites timing analysis. However, on embedded platforms and on the Kalray, many-core memory is in short supply, and isolation requires that each thread has its own memory space containing a copy of all variables it uses. This would lead to significant memory replication, conflicting with our optimization objectives. Another approach providing the same isolation effect would be to ensure that the execution of each node fits into the processor cache. In this case, memory replication is not needed, and the communication phases consist of cache operations alone [44] . However, when user-specified nodes do not fit into the cache (like in our case studies), heavy modifications are needed in the C compiler to automatically slice the original functions into smaller code intervals. This poses a problem in avionics contexts, because it would require not only the development, but also the qualification of the modified C compiler (a long and costly process). For this reason, we do not do it.
We only use global synchronization barriers to enforce the separation between successive iterations of the control loop, using primitives barrier_reinit and barrier_sync.
Memory coherency. To allow shared memory communication on platforms without hardware coherency, we need to ensure coherency through software. This is done using two functions: dcache_inval invalidates the content of the data cache, so that the following reads take their value from the shared RAM; dcache_flush forces the writing of the write buffer contents into the shared RAM before giving control in sequence. This completes the definition of the platform application programming interface (API), which is formed of seven API primitives: time_wait, dcache_inval, dcache_flush, lock, unlock, barrier_reinit, and barrier_sync.
Memory allocation. It is fully static, specified using linker scripts like those in Figure 5 . The allocation of dataflow functions and variables is an output of the mapping algorithms defined next. The allocation of threads, thread stacks, system code, and pre-allocated (e.g., I/O) data is decided prior to scheduling so as to reduce interferences.
Note in Figure 5 that the code, data, and stack of the two threads are allocated on different memory banks (to reduce interferences). When the number of threads is larger than 12, allocation on separate banks is no longer possible. In this case, multiple threads must be allocated on the same memory bank. 7 Also note that, unlike in other approaches [19, 41, 43] , communication is by shared variables alone, and no explicit memory copy is performed by thread code, for either communication or for fby implementation. This largely reduces the data memory footprint, as well as code size, for generated code. This proved to be very important on our memory-constrained evaluation platform, given the size of the use cases.
TIMING MODEL
The structure of the generated code, exemplified in Figures 2 and 5 , has been chosen to allow the computation of tight bounds on the execution time of one cycle of the for loops running in lockstep. Each iteration of these top-level for loops implements one hyper-period of the integration program. Each loop body is formed of the global barrier code (in the yellow box in Figure 2 Each snippet contains a call to the C function 8 implementing the dataflow function. Cache operations placed before and after the function call ensure memory coherency. Lock operations are placed at the beginning and at the end of the snippet. They are always paired to enforce order relations between specific points of different threads (the red arrows of Figure 2 ). One unlock operation marks the dependency start and one lock operation on the same lock marks the dependency end. The code generation process ensures that no other lock operation can access the same lock between the points in time where the unlock and lock calls corresponding to the dependency are executed.
The operations of the snippets-dataflow function calls and API primitive calls-are naturally organized as a directed acyclic graph (DAG). This DAG represents all the dependencies between operations performed during one cycle of the for loops running in lockstep (between two global barriers). The nodes of this DAG are the individual operations. Two operations o 1 and o 2 are connected by an edge o 1 → o 2 if either: (1) o 2 is sequenced immediately after o 1 in the body of one of the threads, or (2) o 1 and o 2 form one unlock/lock pair encoding one of the inter-thread dependencies. We also add to this DAG one special release operation r per release date annotation in the integration program. This operation is connected with a single edge r → o to the first operation o of the snippet associated with the annotated dataflow node.
The scheduling and code generation process ensure that this graph is acyclic. Figure 6 presents the DAG associated with the example in Figure 2 .
Property 1 (Parallel WCET computation). Assume that for each operation o of this DAG we can compute an upper bound on its duration WCET(o) (for release operations, we consider instead the value of the release annotation), and that B is an upper bound on the duration of the barrier synchronization (including the call to time_wait).
Then, according to Ref. [47] , an upper bound on the duration of an iteration of the loops running in lockstep is obtained by adding B to the critical path of the DAG. 9 To allow parallel WCET computation, we still need to compute WCET(o) for all dataflow function and primitive call o. The remainder of this section details how these values are computed.
Dataflow Function Analysis
To derive the non-functional characterization of the C functions implementing dataflow nodes, we use the WCET analysis tool aiT from AbsInt. The tool is the de facto industry standard for static timing analysis [59] .
We use aiT as a black box. 10 It works on binary code. To obtain the characterization of one function, its code is compiled and linked separately, but using the same conventions as for the final 8 In the general case, the call can be guarded by an if statement representing conditional activation. The examples of our article do not feature conditional activation, which simplifies the presentation. 9 From the WCET analysis perspective of Ref. [47] , this DAG is the CFG of the application after adding the communication arcs, assuming that each operation has its own basic block. The fundamental difference w.r.t. Ref. [47] is that we cannot build this CFG starting from existing parallel code, because this analysis must be performed incrementally during parallelization. 10 Readers interested into the theory and practice of WCET analysis are deferred to Refs [58] and [59] . implementation code. To simplify the presentation, we assume, for the scope of this article, that each function inlines all calls to external functions. 11 For a function h, the resulting object code file has exactly one code (.h_text) and one data section (.h_data), which are allocated sequentially in memory (grouping them facilitates memory allocation).
Characterization. By using aiT, we obtain for each function h a characterization consisting of:
-an upper bound WCET(h) on the worst-case execution time of h; -an upper-bound WCSS(h) on the worst-case size of the stack needed by h; -upper bounds WCAT(h, r ) on the worst-case number of memory accesses to the memory regions r containing the code section, the data section, the stack, and to pre-allocated data; 12 -the sizes CS(h) and DS(h) of the generated code and data sections.
Analysis and mapping conventions. The mapping algorithms perform allocation by choosing the start address of the code section (f_ALLOC in Figure 5(c) ). To make sure that the figures we obtained with aiT are worst-case bounds covering all possible mappings, we perform WCET analysis with a fixed stack value computed to match (modulo cache size) the execution time one. It is also required that *_ALLOC values computed by the mapping algorithms are a multiple of the instruction cache line size. Furthermore, *_ALLOC must be a multiple of 4Ko 13 if the code or data sections are larger than 4Ko or if pre-allocated data and the stack include addresses that overlap modulo 4Ko.
Individual Operations in Isolation
Function call operations. In the previous section, we explained how to use aiT to derive the WCET of each dataflow function f called by the threads. But to apply the parallel WCET computation method of Property 1, we need WCET estimations covering the whole cost of each function call operation:
is computed using aiT. The term call_W CET ( f o ) is an upper bound on the duration (in isolation) of the thread code that calls f o : placing the arguments on the stack, obtaining the address of the function, branching to it, and, finally, placing the results (if any) in the locations passed by reference. The term interf (o) is an upper bound on interferences from other threads, and its computation is covered in the next section.
To allow the computation of call_W CET ( f o ), the code that calls f o is analyzed to produce a full non-functional characterization, like that produced using aiT. But aiT can only be applied on binary code, so it cannot be applied on the exact assembly code sequences for function call and return before mapping and code generation take place. For this reason, we rely on manual reasoning to derive the needed upper bounds for the target systems' application binary interface (ABI). The analysis starts from a description of the calling convention used by the C compiler. The resulting characterization is a function on the number, type, and order of arguments of a dataflow function. This manual analysis is done at compiler construction time, for each ABI/C compiler pair.
In our example of Figure 1 , calling the C function associated with f requires placing on the stack the value of variable i and the pointer to x. It requires read accesses to the thread code and to the location of i, and a single write access to the location of x, at the end of the function execution.
Primitive call operations. We also need non-functional characterizations for the API primitive call operations. Given that the number of arguments is here fixed, the WCET of a primitive call operation can be computed as:
Here, the first term only depends on the primitive type p o ∈ {lock, unlock, inval, flush}. To derive the characterization of W CET (lock), W CET (unlock), W CET (inval), and W CET (flush), we rely on a manual process (as for call_W CET ( f o )). For W CET (lock), we assume that the primitive does not need to wait.
Memory Access Interferences
Consider the operation o. Once we have a bound on its worst-case duration in isolation, the only ingredient we need to compute WCET(o) is an upper bound interf (o) on the interferences from other threads. In the absence of shared caches, these interferences come from the interleaving of requests at the level of the multiplexers that guard the access to memory banks.
Consider the memory bank b, and assume that operation o makes r o (b) read accesses and w o (b) write accesses to bank b, for a total of a o (b) = r o (b) + w o (b) accesses. We make a difference between read and write accesses, which often have different durations. On our test platform, read accesses transfer an entire cache line, keeping the bank input multiplexer occupied for RD = 8 hardware clock cycles, whereas write accesses only concern one word at a time and last for only W R = 1 clock cycle. Also assume operation t makes a t (b) = r t (b) + w t (b) accesses to bank b. Then, under the Round Robin arbitration policy, each memory access of o can be delayed by at most one memory access of t. Among these delays, the ones caused by read accesses of t take at most RD = 8 cycles, and there can be a maximum of dr (o, t, b) = min(a o (b), r t (b)) delays of this type. Therefore, an upper bound on the global delay that t may impose on o due to accesses to b is: interf (o, t, b ). An upper bound on the delay that the code of other threads may impose on o is
An upper bound on the global delay that t may impose on
This formula extends previous work on timing analysis of parallel applications [49, 50] by considering the different contributions of read and write accesses to the interference budget.
Note that the synchronizations enforced by lock, unlock, and barrier_sync do not exactly match the frontiers of operations. The computation of r o (b) and w o (b) must take this into account.
PARALLEL BACK-END: SCHEDULING AND CODE GENERATION
The parallel back-end of our tool flow has the same general functions as the back-end of a compiler for an imperative language. Starting from the intermediate representation-in our case, the normalized platform-independent program of Section 3.1.5-it performs memory allocation and scheduling, and then generates target-specific code. Like a back-end compiler for a sequential imperative language [35, 54] , our back-end uses a static scheduling heuristic based on list scheduling (presented in Section 5.4) for the sake of scalability.
Schedulability vs. Optimization
But there are also major differences with respect to classical compilation. The main performancerelated goal here is not to optimize some metric such as speed (throughput), memory footprint, or energy consumption. Instead, it is to produce an implementation that is functionally correct and that respects the non-functional requirements [6] .
To provide schedulability guarantees, our parallel back-end must perform a safe accounting of non-functional properties such as real time or memory use. Safety here means that actual resource use in the implementation must never overstep the reservations made by the back-end. To this end, the back-end maintains worst-case bounds on resource use that is updated at each step of the mapping process, and checked against reservation sizes. Computing safe and tight resource use bounds requires not only knowledge of the major mapping decisions (allocation, scheduling) but also tight control over code generation details such as the structure of the thread code, or C compiler optimization choices. By comparison, classical optimizing compilers provide no timing safety guarantees.
Implementation and Abstraction Issues
Like in other static scheduling approaches, resource reservations are organized in a reservation table (also known as scheduling table or timetable). The data structures manipulated by our scheduling algorithms are the following (in OCaml syntax):
A reservation table is defined by its length and by the static resource reservations it makes. In our case, the length is an input to the scheduling routine and must be equal to the period of the integration program. The scheduling table describes how the resources are allocated to the various computations and communications during one generic cycle of the integration program.
We need to allocate two resources: CPU time and memory. CPU time is allocated to snippets (each containing one function call operation and some primitive calls). Recall that one function call operation is needed to implement each function instance of the integration program. Allocation of CPU time to snippets is done using the inst_cpu and inst_time fields of the reservation_table structure. The two fields are maps (partial functions) from function instances to, respectively, CPU identifiers and time intervals. When scheduling succeeds, they associate to each function instance, i.e., to each snippet, exactly one CPU and one time interval [s, e) with 0 ≤ s < e ≤ l, where l is the length of the reservation table. Each dataflow function f is allocated one memory interval, used to store all its code and local data (sections .f_text and .f_data of its compiled version). Each dataflow variable is allocated one memory interval.
Memory allocation for the boot and thread code, as well as the allocation of stacks is done before the mapping of dataflow function instances, with fixed-size memory reservations done on predefined memory banks, as described in Section 3.2.2 and in Figure 5 .
The reservation table produced by the scheduler must safely abstract the functional and timing behavior of the actual C code. For this to be true, a table and the code associated with it must satisfy a number of well-formedness properties, such as the sequential use of resources 14 or the respect of the control and data dependencies. Most of these properties have been covered in previous literature, e.g., see Refs [12] , [46] , and [53] . We only mention two of them that are important for subsequent developments:
Interf1. The CPU time reservation of a function instance must be longer than the sum of the WCETs of all the operations of the associated snippet (including interferences). Interf2. When two snippets access at least one common memory bank and have nonoverlapping time reservations, their executions must always be sequenced, e.g., by mutex operations.
The second property implies that in the computation of interf(o) (in Section 4.3), we can make the sum over t whose time reservation overlaps with that of o.
Timing Closure
The reservation table of the full application is built incrementally, using the algorithms of Section 5.4. Function instances of the integration program are considered one by one, in an order compatible with the data dependencies. When a function instance is considered, resources are allocated to it, and to all yet unallocated dataflow variables it uses. Once the mapping choices are made for an instance, function, or variable, they are never changed (there is no backtracking).
The main difficulty in this approach is to ensure the respect of property Interf1. When a function instance is mapped, its CPU time reservation must be chosen without knowledge of interferences from function instances yet to be mapped. Given the structure of the generated code, these instances may introduce new memory access interferences, and may change the synchronization code.
To bound the duration of synchronization code in the absence of the full schedule, we assume a particular method for synthesizing it. To each snippet, we associate two synchronization points: one at the beginning (before the inval operation preceding the function call), and one at the end (after the flush operation following the function call). All the synchronization points of all the snippets are fully sequenced using mutex operations, in a way that enforces both the dataflow dependencies between functions and the respect of property Interf2. In the C code, each synchronization point is translated into at most two mutex operations-one lock waiting for the completion of the previous synchronization point, and one unlock to give control to the next point. The code of Figure 2 shows the result of synchronization synthesis for our small example. Some synchronization points do not require here encoding (the start of f and g, the end of h). The total order between the remaining synchronization points is represented with the red arrows.
Under this code generation method, the total synchronization overhead of one snippet is upbounded by 2 × (WCET(lock) + WCET(unlock)). The sequencing of synchronization points must also be taken into account at scheduling time by ensuring that the beginning and end of each snippet reservation are mutually exclusive, for a time span of WCET(lock) + WCET(unlock).
To break out of the circular dependency between scheduling and timing analysis, we set a bound on access interferences as a scheduling constraint called the interference provision budget. It is defined as a percentage of the dataflow function WCET. In the current implementation of the back-end, this percentage is the same for all dataflow functions and is an input to the scheduling algorithms (the provision input in Figure 7 ). Then, in accordance with Section 4, the minimal length of the time reservation made for a snippet calling function f is: In this formula, the interference budget for function f is WCET( f ) × provision. The choice of provision is currently done using a dichotomic search. Through the inst_current_wcet field of the scheduling_state data structure, the scheduling routine maintains at all times a safe upper bound on the execution time of all function instances that were already scheduled. These figures include memory access interferences from already scheduled function instances. Whenever a new function instance fi is mapped, inst_current_wcet(fi) is first computed, and inst_current_wcet(gi) is updated to include interferences from operations of fi for all function instances gi that have already been scheduled. It is required that all times during scheduling, the scheduling state satisfies inst_current_wcet(fi) ≤ length(rt.inst_time(fi)). Mapping choices not respecting this requirement are rejected.
Scheduling Algorithm
The scheduling algorithm is structured as a classical list scheduling heuristic. The use of list scheduling in real-time and embedded systems is by no means new [5, 12, 36, 55] , and for this reason (as well as for space reasons), we do not present here all subroutines. We focus on the top-level routines that include the major originality points: accounting for time interferences and ensuring timing closure.
With the notations of the previous sections, the scheduling routines are presented in Figures 7  and 8 . The first algorithm is the list scheduling driver that makes most mapping decisions. The second algorithm updates the scheduling state based on these decisions and determines if it is well-formed. Thus, it can be seen as a schedulability test.
The scheduling driver consists of an initialization phase, followed by a while loop that schedules at each iteration one function instance of the dataflow. Scheduling will fail if the scheduling of one function instance fails. At each iteration, the function instance to schedule is chosen using function choose among those whose immediate predecessors have all been scheduled. Choice is deadlinedriven. For function instances with the same deadline, we choose the one with least laxity. For the example in Figure 1 , the instance of f is chosen first due to its deadline (the choice being between f and g). Then, g is scheduled (no choice exists because h depends on g), and, finally, h.
Scheduling is performed at the earliest date possible after the release date and after the end of all predecessors. Starting at this date, scheduling will be attempted on every CPU (by the for loop in lines [30] [31] [32] [33] . If scheduling at a specific date fails on all CPUs, time is advanced (line 37) and scheduling attempted again until a solution is found or scheduling is no longer possible given the duration and deadline of the function instance (line 23). If for a given start date multiple CPUs allow the scheduling of the current function instance, we select one that minimizes a cost function. The selection is performed by function choose_optimal (in line 41), and the cost_function currently used is the end date of the time reservation made for the function instance based on the CPU allocation choice. In Figure 9 , we provide the CPU time reservations produced by the scheduling algorithms for the small example in Figure 1 , under the following assumptions: 15 The algorithms allocate f and h on CPU0 and g on CPU1. The length of the reservation table is the period of the application (3,000). The reservation made for f spans from date 0 to date 1,600. That of g spans from 200 to 1,800, and that of h from 1,800 to 2,800. Reservation length must cover the WCET of the function (in light blue), that of the function call code (in dark blue), the interference provisions (in purple), the cache operations (in green), and the start and end synchronization points (in orange).
The allocation and scheduling of f is possible on both CPU0 and CPU1, at date 0. The two possible schedules have the same real-time properties (and, thus, the same cost), and allocation on CPU0 is retained. When mapping g, the earliest start date determined by release date requirement and predecessor dependences is computed (in line 20) as sd = 0. However, starting g at date 0 would mean that the start synchronization points of f and g overlap in time, which is not permitted according to Section 5.3. The call to next_sync_free (in line 28) finds the first date after sd allowing the mapping of a synchronization point that is non-overlapping with synchronization points of previously-mapped function instances. Recall that the length of a synchronization point is δ = WCET(lock) + WCET(unlock). Therefore, the mapping of g is first attempted at date δ = 200 on all processors, through calls to function ScheduleBlockAtDateOnCPU.
This function takes the remaining scheduling decisions, builds the new scheduling state, and performs schedulability analysis to determine if the mapping choices allow the respect of realtime requirements. The reservation end date is chosen in line 8. The time interval must be not yet allocated, longer or equal to reservation( f , provision) (which is used to initialize time_len), and its end (for a length of δ ) must not overlap with synchronization points of previously mapped functions. Function find_free_interval always chooses the smallest end date ensuring this (or fails).
Ensuring that synchronization points do not overlap means that time reservations are often longer than reservation( f , provision). In our example, the end synchronization point of g must not overlap with that of f, meaning that the reservation end is set to 1,800 (hence, the white space of 300 time units inside the reservation made for g). If an end date cannot be found, the function returns NonSchedulable. If a date is found, the scheduling state is updated with the new processor time reservation (in line 10).
In building the scheduling of Figure 9 , we implicitly assumed that the interference budgets are not overstep during scheduling. If, for instance, the interferences of f on g are larger than 200, then the scheduling of g is not possible before the end of the reservation of f.
Memory allocation is performed (and the scheduling state updated) in line 12. For each function instance, we reserve one memory interval, possibly of size 0, which must fit inside one memory bank. This interval must allow the allocation of the code and local data of the C function (if it has not already been allocated for another instance of the same function) and of all dataflow variables that the instance uses that were not allocated with a previous function instance. Allocation inside this interval is fixed by the call to needed_ram in line 17 of algorithm ListSchedulingDriver. The code and data of the function always come first, according to the rules of Section 4.1. If memory allocation cannot be performed, the function returns NonSchedulable.
The remainder of function ScheduleBlockAtDateOnCPU performs schedulability analysis and updates scheduling state fields that are only used to speed up this analysis (inst_current_wcet and inst_wcat). Field inst_current_wcet maintains along the execution, for each function instance that has already been mapped, an over-estimation of its WCET, including the worst-case interferences from other already-mapped function instances, according to the timing model of Section 4. When ScheduleBlockAtDateOnCPU is called on a function instance fi, the code in lines 16-31 sets up the initial WCET estimate for fi and updates the estimation for all already mapped instances gi, according to the model of Section 4.3. The schedulability test is performed in lines 25 and 29 by checking that the current estimations do not become greater than the reserved time intervals, as explained in Section 5.3.
EXPERIMENTAL RESULTS
We evaluated our compilation flow on two large-scale avionics applications from two major companies, denoted A1 and A2 for confidentiality reasons. They are multi-period applications following a classical MAF/MIF execution pattern: the hyper-period of the application, called the major frame (MAF) (120ms in A1, 240ms in A2) is divided into a sequence of minor frames (MIFs) of equal length (5ms in A1, 15ms in A2). Thus, there are 24 MIFs in A1 and 16 in A2. Each function instance of the normalized integration program is statically confined to one of the MIFs through a combination of release and deadline requirements. A1 has more than 5,124 dataflow nodes and more than 36,000 variables before hyperperiod expansion (more than 5Mo of text and data after compilation), and 18,672 function instances after normalization. All these functions are directly instantiated in the integration program, which consists of a single, very large node. This integration program directly exposes dataflow concurrency at the top level of the integration program, so that further inlining of nested nodes is not necessary. A2 is quite different. It has a very hierarchical structure so that parallelism must be exposed through inlining. It has 4,792 function instances after normalization. The WCET of dataflow functions in A1 ranges from 37.5ns to 60.66μs and from 1 to 994μs in A2. By comparison, the total synchronization and coherency overhead (cf. Section 5.3) is less than 265ns per function instance.
The experiments with A1 and A2 have three objectives: to evaluate the scalability of the compilation toolflow, to evaluate the efficiency of the generated parallel code, and, finally, to validate the correctness of the generated code through execution on actual hardware. We validated the first two objectives on the full applications by virtually ignoring the SRAM limits of the Kalray compute cluster (1.65MB available with the system configuration available for our experiments, cf. Section 3.2.1). For this purpose, we configured the parallel back-end to assume larger memory banks. We validated the third objective on A1 alone (due to code availability constraints). Since the application is too large to fit inside the SRAM of a Kalray compute cluster, a work-around consisted in parallelizing and compiling each one of the 24 MIFs separately. Their size is already quite significant, with 690 function instances and 687ko of text and data after compilation, on average; these MIFs, when considered separately as applications, are denoted A1 j , with 0 ≤ j < 24. Note that we could also have considered an overlay mechanism to manage code/text memory at runtime. Such an abstraction remains challenging to implement in a hard real-time environment and is not yet available in our framework.
To evaluate scalability, we focused on the performance bottleneck of the compilation flowthe scheduling and allocation algorithms defined in Section 5.4. On a 4-core Intel Core i7 with 16Gb of RAM, the scheduling and allocation of the largest example (A1) on 8 (resp. 16) cores cores took 29.22s (resp. 42.97s), and that of A2 3.72s (resp. 3.51s). For A1 0 , we plotted in Figure 10 the compilation speed as a function of the number of target cores (resulting code is unimplementable on the target platform beyond 16 cores). The graph is quasi-linear, as the scheduling algorithm attempts mapping on each core before choosing the best solution.
To evaluate the parallelization efficiency of our compilation flow, we must take into account the MAF/MIF organization of our case studies. For an application a parallelized on c ≥ 2 cores with p interference provisions, guaranteed performance is measured as the minimal MIF duration allowing application scheduling with our tool, denoted min_mif(a, c, p). We denote with op(a, c) the optimal interference provisions, i.e., the value of p that minimizes min_mif(a, c, p) for given a and c, and with min_mif_opt(a, c) = min_mif(a, c, op(a, c) ). For the sequential case (c = 1), min_mif(a, 1) is computed as the maximum of the per-MIF sum of block WCETs. 16 Guaranteed speedup of the parallel code with respect to the sequential reference is then defined as gso(a, c) = min_mif(a, 1)/min_mif_opt(a, c). An upper limit on guaranteed speedup for a given application is obtained using the critical path method [30] . For each MIF m, we determine its critical path cp(a, m) and then the parallelization limit of the application is computed as pl(a) = min_mif(a, 1)/ max m cp(a, m).
The speedup figures gso(A1, c) and gso(A2, c) for c = {2, . . . , 16} are provided (in blue) in Figure 11 , along with the upper bounds pl(A1) and pl(A2) (in red). As a measure of the efficiency of the resource allocation algorithms (regardless of timing analysis precision), we also provide 16 Including function call overhead, but no synchronization, cache coherency, on interference. (in green) the guaranteed speedup figures produced by our tool if we assume that synchronization, cache coherency, and interferences impose no overhead.
Clearly, the parallelism exposed in A2 is quite limited (pl(A2) = 2.69). Our back-end virtually reaches this limit when mapping to c ≥ 4 processors (difference of less than 1% with respect to pl(A2)). This is made possible by our choice of mapping algorithm, but also by the fact that function WCETs are significantly larger than the various overheads, and that the critical path is significantly longer than other dependency paths.
A1 exposes significantly more parallelism at a fine grain. The guaranteed speedup is very good, coming within 12% of the theoretical limit. However, overheads come to dominate parallelization gains beyond a certain number of cores. To determine the causes of this behavior, it is interesting to consider the parallelization results on A1 j , 0 ≤ j < 24, in Figure 12 (left). Notice that most A1 j parallelize better than the whole A1 does. This is normal because the results in Figure 11 must consider the worst-case among the MIFs, and because during parallelization of A1, the MIFs constrain one another, resulting in lower overall performance. Also note that for all A1 j , performance is lost beyond a certain number of parallelization cores. To provide insight into the contribution of the various overheads in this performance loss, we provide in Figure 12 (right) the guaranteed speedup for A1 0 (in blue), as well as the guaranteed speedup obtained if we assume that all overheads are zero (in green) or that interference overheads alone are zero (in red). Note that the green line shows almost perfect parallelization. For the full applications (in Figure 11 ) parallelization is almost perfect until the parallelization limit is reached. But for A1 0 , this limit is larger than 16, so it is never reached in our experiments. The red line shows that synchronization and coherency overheads have an impact on performance, but this one still increases quasi-linearly with the number of cores.
The performance loss is therefore clearly due to memory access interferences, also known in the literature as saturation of the memory bandwidth [32] . This phenomenon is well-known in singleand multi-core scheduling, and our timing model predicts it, providing support to engineers in platform dimensioning.
To validate the correctness of the toolflow output, we have executed the parallel implementations of A1 j , 0 ≤ j < 24 on the test platform. Execution time measures were compared with the worst-case bounds provided by our tool, and measured execution time was always smaller than the worst-case bound. Note that while our real-time parallelization method is of "correct-byconstruction" type, validation is by no means sufficient to guarantee the correctness of the implementations generated for real-life applications. It is done using only one application, and covers only real-time aspects and the absence of execution-time deadlocks. Significantly more validation work is needed to provide evidence that the resulting implementations can be used in real-life systems.
24:24
K. Didier et al. Fig. 13 . Predicted vs. measured performance (average for A1 j , 0 ≤ j < 24).
We have also tried to ascertain the efficiency of the generated code from a purely measurementbased perspective. To obtain a measurement-based counterpart of gs and gso, we have measured execution time and compared it to the measured execution time of a sequential implementation. We have done this for each of the 24 tasks A1 j , and then taken the average. Results are provided in Figure 13 (in red), along with the statically predicted execution time (strict upper bounds of execution time, in blue). The statistical significance of these results is low because measure was performed for a single test vector for each A1 j . 17 The performance of the code is lower than the statically predicted one, which is normal because the code was parallelized based on worst-case quantitative data. However, it remains efficient and exhibits the same pattern of performance loss due to memory bandwidth saturation.
CONCLUSION AND FUTURE WORK
We designed and validated the first fully automated code generation flow capable of compiling a real-world control application into a parallel implementation that is both functionally correct and respects non-functional real-time requirements without making simplifying hypotheses concerning the overheads related to parallel/concurrent execution on off-the-shelf hardware. In particular, our flow does not require adding experience-based margins to computed WCET estimates, and thus guarantees the respect of real-time requirements.
One key element of our approach consists in embedding safe and precise timing analyses into the scheduling loop, and conveying precise memory mapping and interference information throughout the compilation and analysis flow. This avoids the pitfalls of methods that first build an implementation and only then perform schedulability analysis. Achieving this requires a tight integration of analysis and synthesis steps: the normalization phase that produces an SSA-like intermediate representation, the code generation steps of the dataflow synchronous program, the back-end C compiler and binary utilities (linker and loader), all the way to the real-time parallel real-time scheduling and timing analysis. Integration ensures global consistency with respect to the timing model of the execution platform. The method provides good results on real-world applications, and is scalable. These results may percolate into industrial processes in the future, yet this may involve evolutions of the certification procedures and will certainly require major efforts in qualifying parallel-capable versions of the leading tools (e.g., Scade Suite and its KCG compiler).
From a scientific perspective, many challenges remain. The experimental evaluation emphasized the importance of evenly distributing the application load among its minor frames. This optimization problem is related to I/O latency requirements specific to each application and industrial workflow, which are generally not exposed in the source model, but rather at more abstract system design levels. In the longer term, our success on one specific shared-memory architecture motivates an extension toward more complex or different hardware platforms. Extension to TC27x [31] requires taking into account the NUMA model. Extension to the full Kalray MPPA many-core and other time-predictable architectures [25, 62] requires taking into account network-on-chip (NoC) interconnects and/or shared caches (e.g., by cache partitioning). On more classical multi-cores (ARM, POWER, x86) featuring speculative out-of-order execution or hardware cache coherence, it does not seem realistic to provide static hard real-time guarantees, and different approaches with lower safety expectations should be considered.
