Abstract-Threads are a wildly non-deterministic model of computation, difficult to analyze in the general case (the wolf of our title). But when system specification is a deterministic dataflow program written in Lustre, Scade or Simulink, the implementation process should build not just multi-threaded C code, but (first and foremost) a richer model exposing the easyto-analyze dataflow, race-free organization of the computations performed by the implementation (the titular sheep). We propose a language for such implementation models. It allows the formulation of functional correctness properties the multi-threaded implementation must satisfy for an avionics use case running on a commercial many-core.
I. INTRODUCTION
Mastering concurrency is difficult, and yet hardware design resolutely moves towards increasingly massive parallelism based on the use of chip-multiprocessors. Threads are one of the major programming paradigms for such multi-and many-core systems. They arguably provide the best portability and the finest control over resource allocation, which are both essential in the design of embedded applications that need to get the best guaranteed performance out of resourceconstrained hardware.
Such expressiveness comes at a price. As a model of computation, threads are wildly non-deterministic and noncompositional [15] , making programming, formal analysis, and implementation difficult [18] , [13] . This explains why multi-threaded software is often bug-ridden even in the context of critical systems [14] .
But there are also good news: in many industrial contexts (avionics, automotive, etc.) the use of threads is tightly controlled. We consider in this paper the particular case where the functional specification of the system is done in a synchronous dataflow language such as Lustre/Scade [8] or a sub-set of Simulink [22] . In this case, multi-threaded implementations have particular structure and properties: The number of threads is fixed, each one implementing a recurrent task obtained by sequencing a fixed set of dataflow blocks (or parts thereof, obtained by parallelization). When taken individually, such properties largely facilitate the formal analysis of multi-threaded systems. But in many cases, the multithreaded implementation preserves a fundamentally dataflow structure, with specific rules on the way platform resources (shared memory, semaphores) are used. Such implementations are not only data race free (DRF) in the restricted sense of [13] , but also deterministic. a) Contribution. : We propose a language, named InteLus, for the description of such implementations. It is a sub-set of Lustre extended with annotations representing mapping and code generation choices. Such extensions are common in existing literature, but our language and modeling approach go beyond previous work in one fundamental way: implementation models specified in InteLus are strictly richer than the multi-threaded C code we want to generate. InteLus allows the representation of all mapping decisions needed for multi-threaded code generation. 1 InteLus's representation of threads and thread synchronization is a sub-case of the C11/pthread concurrency model [3] . Therefore, C code can be obtained by selectively putting elements of the InteLus program into C and linker script syntax without making any further mapping decision-a process we call pretty-printing, exemplified in Section I-A.
Annotations are covered by the InteLus operational semantics. This allows us to formally define (but not yet prove) the correctness of implementation models. To facilitate the definition of the correctness properties, implementation models are endowed with not one, but two semantics: the synchronous semantics of Lustre (which simply discards mapping annotations) and the machine semantics, defined in Section III-C, which interprets the program and its annotations as a multithreaded imperative program. This dual semantic nature of our implementation models enables us to envision an original approach to proving implementation correctness, based on 3 proof obligations:
1) Refinement Under synchronous semantics the Lustre specification and the InteLus implementation model are equivalent modulo a pipelining transformation. 2) Mapping Executing the implementation model under machine semantics produces the same sequences of values as those produced by the same model under synchronous semantics. 3) Code generation The machine semantics of the implementation model faithfully describes the behavior of the multi-threaded C code produced by pretty printing. Outline: The bulk of this paper is dedicated to building the formal apparatus allowing to state the second proof objective. We start by providing a motivating example, and by reviewing (in Section I-B) previous work. Section II presents the dataflow sub-set of InteLus-its syntax, dataflow semantics, and a more complex implementation model example. Section III defines the mapping annotations and the machine semantics of InteLus. Sections II and III also provide implementation correctness and semantics preservation criteria. Section IV presents our experimental results, and Section V concludes. A. Motivating example. Fig. 1 provides a simple dataflow program, a very simple C implementation with two threads, and the corresponding implementation model (in the middle column). Mapping is done on a shared memory multi-core satisfying the requirements of Section III-A. Such an implementation can be manually written or automatically synthesized using existing tools [6] , [11] . However, in this paper we are not concerned with synthesis issues. Instead, we focus on defining the syntax, semantics, and correctness of implementation models.
The program in Fig. 1 (a) uses a simplified Lustre [8] syntax, presented in Section II, to define a simple producer-consumer application with a single communication variable x. Even for this trivial example, a parallel (two-core) C implementation is already quite complex. To ensure correct operation sequencing and communication on the multi-core, calls to Prod and Cons are surrounded by:
• Mutex operations (lock and unlock API calls) ensuring that production happens before consumption, and that consumption is completed when a communication variable is reused for production.
• Data cache operations (flush and inval API calls) implementing the memory coherency protocol ensuring that the consumer uses the correct data. Explicit cache operations are only needed on platforms without hardware cache coherency such as our test platform. On more classical POWER or ARM multi-cores, they can be simply discarded, as the semantics of lock and unlock (of either the pthread or C11 mutexes) ensures the needed coherency.
The multi-threaded implementation consists not only of C code, but also comprises GCC annotations and the linker script defining memory allocation. Such tightly-controlled mapping is common in critical embedded systems. In avionics applications like our case study, the worst case execution time must be demonstrated for normal conditions, but the application must also be robust to "external factors". The choice of a mutex-synchronized implementation improves robustness by guaranteeing the respect of the functional semantics regardless of timing aspects. Providing execution time guarantees can then be done through tight control of memory allocation and synchronization, and through the use of hardware with good support for timing predictability. These design choices, covered elsewhere [5] , reduce timing variability and facilitate timing analysis [21] .
The implementation model of Fig. 1(b) consists of a dataflow program (in black and blue) extended with annotations defining all the aspects of its mapping (in red). The dataflow program uses some extensions to Lustre (in blue) allowing the description of synchronization. These extensions, defined in Section II, include the synchronization data type event and the wait and done constructs that allow the definition of sequencing constraints not implied by data variables. The dataflow implementation program provides a precise functional model of the execution on platform. For instance, specification variable x is replaced here with three variables x, x_cpu0, and x_cpu1 allowing the representation of the various states of the memory system where the value produced on processor cpu0 has not yet been propagated to the RAM or to the cache of processor cpu1. The implementation model provides dataflow interpretations for the various API calls. For instance, in line 17 the dataflow interpretation of flush ensures that the local value of x_cpu0 has been propagated onto its RAM counterpart x, and in line 18 the dataflow interpretation of unlock produces a token (the special literal top) that can be consumed later by a lock call. The equations in lines 11 and 12 are not part of threads. They provide the semantics of platform mutexes.
Under the application-specific code structuring hypotheses detailed in Section III-A, the mapping information of Fig. 1(b) is exhaustive. It allows the generation of all the C code, GCC annotations, and linking directives of Fig. 1(c) by simple pretty-printing, as no new mapping decisions are needed. For instance, the list of equations of a thread is transformed lineby-line into the sequence of function calls forming the body of the infinite loop of the corresponding C thread.
B. Related work and originality
Much work exists on parallel application mapping (e.g. [2] , [1] , [24] ), but it involves a non-trivial code generation phase that escapes formal analysis, covering at least some of the aspects we consider here: thread construction, synthesis of synchronization and memory consistency protocols, etc. The language we propose allows formal reasoning on the correctness of all these mapping and code generation decisions. The only step not covered by our correctness formalization is pretty printing, which moves to a C syntax while preserving unchanged the thread structure and the allocation.
Another topic we do not cover in this paper is that of normalizing complex Lustre/Scade specifications to put them in the simplified syntax of InteLus. This aspect is orthogonal to our mapping-oriented perspective, and has been covered in previous work [4] , [19] , [11] .
Our modeling language and method do not cover timing aspects. Thus, it does not allow reasoning about real-time correctness. For time-triggered systems [24] this also means that we cannot reason about functional correctness.
Our results are closely related to previous work on providing (operational) semantics to synchronous languages and to concurrent C. Our machine semantics is close in form to the operational C11 semantics of [13] , most notably to the variant without the "promise" rule, which is adapted to DRF programs like the ones we synthesize. Main differences are that we consider a very restricted concurrency model, and that we consider a particular type of shared memory architecture and the associated memory allocation problem (which previous semantics [13] , [3] do not cover). From a dataflow language perspective, our paper includes novel operational semantics for Lustre covering the language extensions (triggers and synchronization-only variables) and the machine semantics that formalizes execution on platform at the dataflow level. thread on cpu0 at 0x20000 stack 0x30000 Our main contribution is not semantics per se, but the formalization of proof obligations ensuring implementation correctness. This is the first step towards formal proof of correctness for the multi-threaded implementation of Lustre programs. In this sense our work is related to previous results on the formally verified compilation of dataflow synchronous languages [4] , [19] . By comparison, our work extends dataflow modeling to cover multi-processor implementation issues (mutex synchronization, memory consistency), but we have not yet proved the correctness of the method.
The objective of reducing the semantic distance between specification and implementation is also covered in [25] . The difference is that we rely on a dataflow model with simpler control structure, and the fact that we consider aspects such as synchronization, memory consistency, and memory allocation.
From a more classical modeling point of view our work does not aim for the generality of UML/MARTE [17] , but rather to provide a specific solution to the problem of correct multithread implementation of dataflow synchronous specifications. In this, it joins previous work that enriches dataflow languages with annotations describing non-functional requirements [2] , [1] , [24] .
II. MODIFICATIONS TO LUSTRE This section introduces a synchronous dataflow language for system-level functional specification and for defining the dataflow part of implementation models. We could define this language as a strict extension of Lustre [8] with new constructs for our modeling needs. However, the system-level and mapping-oriented perspective makes some major features of Lustre/Scade unneeded, and including them would only pollute our presentation. For this reason, we remove them. We shall not insist here on the syntax and semantics of Lustre, which has been covered elsewhere [8] , [9] . Instead, we focus on the various modifications. 
Fig. 2. Lustre language subset (in black). InteLus extensions (blue). Extension with mapping information (red).
We call the new language InteLus, for Integration Lustre, and its syntax is provided in the black and blue parts of Fig. 2 . Unlike Lustre and Scade, which also allow the programming of the sequential tasks of an embedded system, InteLus is only designed to allow the system-level integration of these tasks. An InteLus specification never needs to include/use another (no modularity). For this reason, full-fledged interface definitions are not needed. We only identify input variables, which intuitively correspond to memory-mapped input devices.
InteLus assumes that all sequential tasks have already been built, taking the form of sequential functions 3 called from the InteLus program, like Prod and Cons in Fig. 1(a,b) . All memory used by these functions must be exposed to the memory allocation algorithms under the form of dataflow variables. 4 InteLus incorporates the event type of Signal [16] . 5 Variables of this type carry no information, representing pure synchronization. In Fig. 1(b) and Fig. 3 we use them to provide the dataflow interpretation of mutex operations. We also use them to define control dependencies between equations that do not exchange data. This is done through the use of the novel wait and done constructs. When placed in front of an equation, wait(s 1 , . . . , s k ) will delay the start of the equation until a top value (a token) can be read on each of the variables s 1 , . . . , s k . When placed in front of an equation, done(s 1 , . . . , s k ) will write top on each of the variables s 1 , . . . , s k after the completion of the execution of the equation.
Sequencing dataflow equations to build threads running in an asynchronous environment requires a normalization phase. For each equation of a thread, this phase builds a guard, which is the (possibly empty) cascade of tests needed to determine, at each execution cycle where the control reaches the equation, if the equation is executed or not, thus allowing giving control in sequence. 6 Guards are placed in front of equations. A guard always starts with a variable of type event (or the literal top) which identifies the trigger event of the cascade of tests. We require that all equations of a thread have the same trigger. It represents the event triggering iterations of the thread body. When the trigger is top, no external event is needed to trigger iterations, and the thread body is an infinite loop (like in our example). Triggers different from top allow the representation of interrupt-driven tasks (not present in our examples).
In each guard, the trigger is followed by a sequence of tests and synchronizations. The test of Boolean variable C is represented with "on C". The constructs wait and done allow the definition of control dependencies at all levels of guard decoding. Consider the following equation:
At each cycle where u is present and x=true, a top value is waited for on variable a after the test on x was performed and before the test on y is executed. During the same cycles, a top value is written on variable b after f is executed, if y=true, or just after the test on y, if y=false. Note how, unlike previous uses of guards in synchronous languages [23] , our guards focus on synchronization, clearly representing the flow of control from trigger to cascade of tests and synchronizations, until passing control in sequence.
An InteLus program featuring non-trivial guards is provided in Fig. 3 . This program is an optimized, pipelined implementation of the program in Fig. 1(a) . It contains several features not present in the non-optimized version. The Boolean variable c is tested by guards to allow the incremental activation of equations during the prologue of the pipelined implementation. Software pipelining [7] requires some memory replication, to allow Prod and Cons to work in parallel on different copies of variable x. The copy used by Prod is located at address 0x20500, and that used by Cons at address 0x30500. The copy from one location to another is realized at each cycle by the assignment in line 21.
Like its non-optimized counterpart, the InteLus program features equations needed for semantic completeness but generating no code in the C threads. For instance, the equation in line 14 generates no C code, because x2 and x1 share the same memory location.
Notation "_" is an lvalue for equation output values that are not needed. It helps reduce the number of variables, but can only be used on values of type event, which require no allocation.
A. Synchronous semantics principles
Semantics of Lustre have already been defined [8] , [9] . Our language extensions (guards) are non-trivial, but the space of this paper is not sufficient to define the full synchronous semantics of InteLus. We provide here its principles (needed later in the paper) and defer the interested reader to [10] .
The semantics of statements and programs is described through structural operational semantics (SOS) rules used to derive transitions of the form p
Here, p and p are 6 The decision not to execute is particularly valuable in an asynchronous environment where absence of an event cannot a priori be detected. 
We say that a program is correct if a trace exists for every sequence I 1 , . . . , I n assigning values different from nil to every input variable. We denote the set of all traces of program p with Traces(p). Given the determinism of the synchronous semantics, Traces(p) can be seen as a sub-set of R * V , where V is the set of variables of p.
B. Kahnian asynchronous interpretation
The determinism of the InteLus dataflow equations means that we can endow InteLus programs with asynchronous Kahn process networks (KPN) [12] semantics. This allows reasoning about synchronization in an asynchronous parallel environment before considering resource allocation. It enables us to state (in Section II-B4) well-formed properties ensuring implementability. A strong semantics preservation property, defined in Section II-B3, links the asynchronous interpretation of an InteLus program to its synchronous interpretation. The small-step operational form of the kahnian semantics also facilitates the link with the machine semantics of Section III-C. This facilitates the definition of implementation correctness.
1) Notations:
Under kahnian semantics, program equations are deterministic Kahn processes communicating through infinite lossless FIFO channels corresponding to the variables. The semantics is asynchronous. Absence of a variable value cannot be reacted upon (only presence). Execution traces are represented with histories assigning sequences of values different from nil to each variable identifier. Given a set of variables V , we denote with Hist(V ) the set of finite histories h assigning to each v ∈ V a sequence of values
The asynchronous observation of a synchronous trace discards synchronization, retaining for each variable the sequence of values different from nil. It is defined as δ : R V * → Hist(V ) where δ(t 1 t 2 ) = δ(t 1 )δ(t 2 ) for all t 1 , t 2 ∈ R V * and δ(r)(v) = r(v) if r ∈ R V with r(v) = nil and δ(r)(v) = (the empty word), if r(v) = nil.
2) Program state:
In the classical definition of Kahn process networks [12] We say that a state p, h is complete if for every variable v defined by v = k fby y all read pointers are equal to len(h(v)) − 1, and for all other variable w the read pointers are equal to len(h(w)). Complete states can be put in direct relation with states of the synchronous semantics. We denote with strip( p, h ) the synchronous state term obtained from p, h by removing the history h and the read point annotations.
3) Semantic rules and semantics preservation: Semantics is presented in Fig. 4 , under SOS form. Concurrency is by interleaving. At each transition, exactly one equation is executed among those that can be executed (cf. rule (interleave)). The rules for guards (on*) and (trig-*) are optional. Without them, the semantics works for programs without guards. Predicate adv(h, v, n) determines whether we can read the n th value of variable (v) in history h. It is used to determine if enough input is available to enable the execution of a transition.
The synchronous and Kahnian asynchronous semantics of InteLus are tightly related: eq1, . . . , eqi, . . . , eqn, h → eq1, . . . , eq i , . . . , eqn, h (interleave) 
Theorem 1 (Semantics preservation). Let p be an InteLus program that is correct under synchronous semantics. Then:
∀i ∈ 1, m : adv(h, Yi, k) (x1, . . . , xn) = Ffid (hk(Y1), . . . , hk(Ym)) (X1, . . . , Xn) = fid(Y k 1 , . . . , Y k m ), h → (X1, . . . , Xn) = fid(Y k+1 1 , . . . , Y k+1 m ), hδ( Xi → xi | i = 1, n ) (fcall) adv(h, Y, n) X = k fby Y n , h → X = hn(Y ) fby Y n+1 , hδ( X → hn(Y ) ) (fby) adv(h, C, m) hm(C) = false adv(h, X, m) Y = X m when C m , h → Y = X m+1 when C m+1 , h (when-) adv(h, C, m) hm(C) = true adv(h, X, m) Y = X m when C m , h → Y = X m+1 when C m+1 , hδ( Y → hm(X) ) (when+) adv(h, C, m) hm(C) = true adv(h, X, n) Z = merge C m X n Y p , h → Z = merge C m+1 X n+1 Y p , hδ( Z → hn(X) ) (merge+) adv(h, C, m) hm(C) = false adv(h, Y, p) Z = merge C m X n Y p , h → Z = merge C m+1 X n Y p+1 , hδ( Z → hp(Y ) ) (merge-) p, h → p , hδ(x) ∀ i : adv(h, Si, k) ∀i, j : (x(Si) = x(Tj ) = nil) ∧ (Si = Tj ) r(p, k), h → r(p , k + 1), hδ(x ∪ Ti → top | i = 1, n ) (wait-done) adv(h, C, k) hm(C) = false on C k eq, h → on C k+1 eq, h (on-) adv(h, C, k) hm(C) = true eq, h → eq , hδ(r) on C k eq, h → on C k+1 eq , hδ(r) (on+) eq, h → eq , h top eq, h → top eq , h (trig-top) adv(h, S, k) eq, h → eq , h S k eq, h → S k+1 eq , h (trig-comp) eqi, h → eq i , h
Fig. 4. KPN semantics. Predicate adv(h, X, k) is defined as len(h(X)) > k.
Proof sketch: Direct application of Theorem 4.4 of [20] . The tedious part of the proof consists of interpreting equations of p as micro-step state transition systems (μST S) and proving that the synchronous (resp. asynchronous) composition of the μST Ss associated with equations faithfully represents the synchronous (resp. asynchronous) semantics of p. Once this is done, Theorem 4.4 can be applied, after noting that the synchronous correctness of the program ensures that the asynchronous composition of μST Ss is non-blocking.
4) Properties ensuring implementability:
For an implementation model to allow straightforward C code generation through "pretty printing", its dataflow part must satisfy a number of properties that are naturally expressed under Kahnian semantics. These properties are also amenable to low-complexity verification and synthesis based on sufficient structural properties of the program. a) Boundedness: When the program of Fig. 1(a) is executed under Kahnian semantics, Prod may be executed an indefinite number of times before Cons is executed once. Such a behavior requires infinite storage, and is not amenable to static resource allocation. We require that our implementation models (like those in Fig. 1(b) and Fig. 3 ) ensure that for each variable a single value needs to be stored at every time:
-boundedness). An InteLus program p is called 1-bounded if in any reachable state p , h of the Kahnian semantics, for every variable v that is not an input and for every read point annotation x of v, we have len(h
b) Explicit synchronization: Under dataflow semantics, data accesses are synchronizing. Under machine semantics, they are not. Assume that a variable v is used for communication between equations eq 1 and eq 2 that will be mapped on different threads. Then, variables of type event, later mapped onto semaphore operations, must be used to represent the synchronization associated to these data accesses. The completeness of the synchronization can be checked on the dataflow implementation model. Indeed, if enough synchronization has been added, then the Kahnian semantic transitions of eq 2 do not need to perform the synchronization tests on v (the adv conditions in Fig. 4) .
c) Implementation of fby with no internal memory: The general translation of fby into C code requires the use of internal storage. However, all memory must be exposed to optimization, so we always impose scheduling constraints (through wait/done dependencies) ensuring that internal storage is not needed. 7 III. MAPPING ANOTATIONS AND MACHINE SEMANTICS Previous sections have introduced the syntax and semantics of the dataflow part of InteLus. We introduce now mapping annotations -the red syntax elements in Fig. 1, 2 , and 3 -and the machine semantics of full implementation models.
A. Structure of an implementation
We target shared memory multi-core architectures. This definition covers classical multi-core processors, or parts of larger architectures, such as the compute clusters of Kalray MPPA many-core processors (www.kalray.eu), which we use as test platform. To simplify the presentation, we assume that our platforms have a single address space with physical addressing.
On such hardware, we map statically scheduled, bare metal implementations. Like in our example of Fig. 1 , each CPU is assigned one sequential thread -a function that never terminates. An initialization function is called by one of the threads when execution starts. A global synchronization barrier (function global_barrier) separates initialization from the rest of the execution. We assume the software is already deployed, meaning that we do not cover here boot-related issues. Each thread is loaded at a fixed address along its local data (if any). The same is true for functions called by threads. For each thread, the stack base is statically defined.
Synchronization between threads is done using mutexes with the classical pthread or C11 semantics. We use two mutex primitives. Primitive unlock(l) should only be called when mutex l is locked (false). Otherwise, the behavior is undefined. It changes the state of l to true. Primitive lock(l) waits until l is unlocked (true) and changes its state to false. If multiple lock(l) statements are waiting when unlock(l) is executed, then only one of them is nondeterministically chosen and executed. Our implementations will ensure that no concurrency exists between lock(l) calls.
On platforms without hardware cache coherency, such as our test platform, we do not rely on lock and unlock to ensure memory coherency, using instead explicit cache operations. 8 Primitive inval(addr) invalidates the cache line containing address addr. The next access to an address in this cache line is guaranteed to be a cache miss, which forces loading from the shared RAM. Primitive flush(addr) forces the writing of local modifications to the shared RAM, and enforces a memory barrier afterwards. When the execution of flush completes, all local modifications have already been written to shared RAM.
Thread synchronization must ensure that the code is datarace free (DRF) and deterministic. This amounts to ensuring that any two data accesses of which one is a write are statically ordered. Mutex operations must also ensure that unlock(l) is never called when l is unlocked, and that no two lock operations can be active on the same mutex at the same time.
B. Mapping annotations
This section describes the syntax and intuitive semantics of mapping annotations (in red in Fig. 1, 2, and 3) .
Thread annotations identify the sets of dataflow equations that generate code sequenced into C threads (in lines 18-29 in Fig. 3 ). Code generation preserves the order of equations, performing no rescheduling. Remaining equations require no code inside threads. These equations represent internal thread synchronization made useless by thread sequencing (in lines 12, 13), the behavior of platform mutexes (in lines 16, 17) or fby, sub-and over-sampling equations requiring no code due to memory allocation or scheduling choices of Section II-B4 (lines [14] [15] [16] .
Memory and mutex allocation annotations define the address of each function, thread, thread stack, and variable of type different from event. For functions and variables, we assume that code and local data are allocated as a contiguous memory region (e.g. code section after data section). For variables, allocation can be either direct (an absolute address) or by alias -the name of another variable having a direct allocation. Variables associated with mutexes are allocated to platform mutexes. 8 To simplify presentation, we assume each variable fits inside a cache line.
Processor allocation annotations define the allocation of threads to processors and the dataflow variables representing the status of CPU caches. In Fig. 1(b) , the first thread is allocated on cpu0, variable x_cpu0 is the value of x as viewed from core cpu0 (the result a read would return). Variables with a memory allocation but without a processor allocation represent the status of the main memory.
API annotations provide the implementation of equations representing mutex and cache operations. In the machine semantics, the dataflow code of these equations is not executed. Instead, the corresponding API primitive is called, with the given arguments.
C. Machine semantics
The machine semantics provides an operational description of the platform execution, once the application has been mapped on it and once initialization has been performed. It is not meant to be a semantics for general multi-threaded implementations -it only covers the very restricted control structures of the implementations we target. Furthermore, when compared to classical cycle-accurate simulation semantics, it already includes elements of abstraction meant to hide timing aspects 9 and to isolate the semantics of the well-formed code synthesized from the implementation model from the semantics of the potentially more general C code of external functions (such as Prod or Cons in our example). This isolation means that we cannot know the exact position of control (the program counter) during execution of a function, nor the way the function manipulates data during computations. For space reasons, this semantics covers only dataflow programs where the guard triggers are top.
1) State representation:
In the platform semantics, states provide an abstract view of the state of the execution platform components -CPUs, memory hierarchy, mutexes. They have the following structure (for clarity, we use an OCaML syntax for its definition): • In some rules the bullet is not visible. In this case, the term containing it is colored purple. In the initial state, in each thread, the bullet is placed before the first equation. The set of all possible such state representations obtained by bullet annotations is denoted ctrl_state.
b) Memory system state: It includes the state of the RAM and that of the CPU data caches. We denote with addr the set of memory addresses that can be used by dataflow variables.
At each instant, a memory location can be either written by some equation (as an output variable), or read by zero or more equations, as an input variable. Performing a read access on a variable that is currently written by another equation, or a write access on a variable that is currently written or read by another equation is an error. In the definition of type mem_loc_state, variant W corresponds to the write state, and posint gives the non-negative number of readers in the read access case.
In the case where a location is read, its value can be undefined (U) or defined (D) with a value of type word. Value U is used when we cannot determine the value of the location.
In a given state, read accesses from different processors to the same memory location may produce different values. To allow reasoning on this memory consistency issue, type loc_value structures the storage of these values. The value stored on record field ram is that stored in RAM. The value stored on cache(p) is the one an access from CPU core p would return.
The initial memory state is determined by the initial value of fby equations of the program: All locations are in state (R (0,v) 
D. Correctness and semantics preservation
Consider an implementation model whose dataflow part is correct under synchronous semantics and which satisfies the Refinement proof obligation set in the introduction 10 and the correctness properties of Section II-B4. To be considered correct, this implementation model must further satisfy three mapping-related properties: (1) it must allow the generation of executable code, (2) code execution must not lead to error states, and (3) execution must implement the dataflow semantics of the implementation program.
To allow code generation, a number of structural properties must be satisfied, e.g. no two threads must share the same processor, no two functions can have the same memory allocation. These properties can be statically checked at low cost by the code generation (e.g. compilation) tools.
Assuming proof obligation Code generation is satisfied, proving properties (2) and (3) amounts to proving that under machine semantics execution never enters state Err and that execution under platform semantics preserves the semantics of its dataflow part under Kahnian semantics. Semantics preservation is defined as the preservation of the sequences of values taken as input (as guards or actual equation inputs) and produced as output by the execution of the various equations (which can be represented with histories, introduced in Section II-B2). In practice, ensuring this amounts to ensuring a number of consistency properties that can be checked by static analysis. For instance, structural properties of the synchronization protocol can be used to determine that successive uses of a mutex for different synchronizations do not interfere, and thus satisfy the requirements set in Section III-A on the use of lock and unlock.
IV. EXPERIMENTAL EVALUATION This paper introduces a language, not a mapping tool. From this perspective, we have already provided the elements showing that we can model implementations featuring optimized resource allocation. As a supplementary element of proof, this section shows that our language allows the representation of highly optimized implementations produced by an automatic mapping tool [11] .
We mapped with this tool a complex piece of critical avionics software, comprising more than 5000 unique dataflow nodes and 36000 variables. Its current implementation is sequential and time-triggered. Periodically, a sequence of 24 non-preemptive sequential tasks (built out of nodes) are triggered at fixed time intervals. Our first objective was to parallelize each of the 24 tasks, taken separately, and demonstrate the correctness of the implementations. Each task has been transformed into an InteLus specification and an automatic mapping tool performed allocation, scheduling, memory allocation, and the synthesis of the synchronization and memory coherency protocols. Implementation programs were synthesized, from which C/ldscript code allowing compilation was generated. Using the results of this paper, the functional correctness of implementations can be stated.
The performance measurements of [11] involved few memory allocation optimizations. We now applied aggressive optimizations to reduce memory and mutex needs. The following table shows various characteristics of the largest six tasks (in number of nodes) and averages on the 24 tasks. The first important figure concerns memory allocation (Mem), where reuse allows a 71% reduction in memory needs w.r.t. the previous implementation using one location per specification variable. We exclude input variables from statistics, because they are currently excluded from optimization (work in progress). The number of point-to-point thread synchronizations (Sync in the table) is large, because very fine grain synchronization is used to allow efficient resource allocation. However, the cost of these synchronizations is very low (30 CPU cycles for a unlock/lock pair). Furthermore, only 120 platform mutexes are needed, due to mutex reuse.
Our modeling approach allows the representation of such highly optimized implementations.
V. CONCLUSION We have provided proof of our initial claim: Dataflow specifications can be given efficient multi-threaded implementations that preserve an internal dataflow structure. The dataflow structure of such implementations can be exposed using new language constructs. Doing so facilitates formulating the correctness of implementations, and also proving it, through the use of results such as Theorem 1, or through the use of analysis techniques specific to the dataflow domain.
Work will continue along two axes. We will enrich the implementation modeling language to incorporate new features. The first steps will be to include real-time capabilities and to diversify the target architectures by considering communication devices (DMAs, networks, external RAMs, etc.). The second objective is to develop a formally proven translation validation tool covering the transformation of functional specifications into implementation programs. This tool needs to be complemented later with the validation of the translation from implementation programs to code running on the platform, and with the validation of the translation from Lustre to InteLus.
