Code generation for multi-phase tasks on a multi-core distributed memory platform by Fort, Frédéric & Forget, Julien
HAL Id: hal-02295835
https://hal.archives-ouvertes.fr/hal-02295835
Submitted on 24 Sep 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Code generation for multi-phase tasks on a multi-core
distributed memory platform
Frédéric Fort, Julien Forget
To cite this version:
Frédéric Fort, Julien Forget. Code generation for multi-phase tasks on a multi-core distributed memory
platform. 2019 IEEE 25th International Conference on Embedded and Real-Time Computing Systems
and Applications (RTCSA), Aug 2019, Hangzhou, China. ￿hal-02295835￿
Code generation for multi-phase tasks on a
multi-core distributed memory platform
Fre´de´ric Fort
CRIStAL
Univ. Lille, UMR 9189
Lille, F-59000, France
frederic.fort@univ-lille.fr
Julien Forget
CRIStAL
Univ. Lille, UMR 9189
Lille, F-59000, France
julien.forget@univ-lille.fr
Abstract
Ensuring temporal predictability of real-time systems on a multi-core
platform is difficult, mainly due to hard to predict delays related to shared
access to the main memory. Task models where computation phases and
communication phases are separated (such as the PRedictable Execution
Model [23]), have been proposed to both mitigate these delays and make
them easier to analyze.
In this paper we present a compilation process, part of the Prelude
compiler [20], that automatically translates a high-level synchronous data-
flow system specification into a PREM-compliant C program. By au-
tomating the production of the PREM-compliant C code, low-level imple-
mentation concerns related to task communications become the respon-
sibility of the compiler, which saves tedious and error-prone development
efforts.
Index terms - Code generation, PREM, distributed memory
1 Introduction
Multi-core hardware platforms are increasingly being used for the implementa-
tion of embedded systems, due to their potential for increasing system perfor-
mances. However, implementing real-time systems on such platforms remains
complex, mainly because cores share access to a central memory. This leads to
contentions, which cause significant execution delays that are hard to predict,
because they require to finely analyse task codes, task interferences and the
contention resolution mechanisms [24].
To simplify the analysis of task interferences, the PRedictable Execution
Model (PREM) [23] advocates to decouple communication phases from com-
putation phases. For instance, the AER task model [8], a declination of the
Partially funded by the French National Research Agency, Corteva project (ANR-17-CE25-
0003)
1
PREM model, splits each task of the system into three phases. The Acquisition
phase loads task data and instructions from the main memory into the core’s
local memory. Then, the Execution phase performs the task computations us-
ing only local memory. Finally, the Restitution phase copies the results of the
E-phase back into the main memory, for use by other tasks. This simplifies
timing analysis because: 1) communication phases are clearly identified, so the
system scheduler can schedule communications [1, 14] and avoid contentions;
2) worst-case execution time analysis (WCET) of computation phases does not
need to take bus contentions into account [23].
Manually implementing a PREM-compliant program is tedious, unintuitive
and error-prone. Another solution is to rely on a compiler that automates
phases separation. For instance, PREM-compliant compilation for the LLVM
framework has been proposed in [11, 17]. Our approach also tackles PREM-
compliant C code generation but starts from a higher level of abstraction than
previous approaches.
Contribution We present an extension to an existing compiler, which pro-
duces PREM-compliant code. The input of the compiler is a Prelude syn-
chronous data-flow program [20]. The output of the compiler is concurrent
multi-task PREM-compliant C code. The synchronous semantics is close to
the PREM model, making the translation into PREM natural. We target a
multi-core platform with distributed memory: one shared main memory plus
one private scratchpad memory (SPM) for each core. According to a predefined
distribution of tasks onto cores, the compiler generates a separate C code for
each core. The code includes mechanisms to execute tasks periodically, syn-
chronise task communications across cores and perform data transfers between
local memories and the main memory. The main advantage of our approach
is to simplify the development process, by automating the translation from the
high-level specification in Prelude to the low-level implementation in C. In
particular, concerns related to task communications become the responsibility
of the compiler.
Validation We experiment with the Rosace case study [22], running on an
FPGA platform with two NIOS-II Altera processors, using the Erika Real-Time
Operating System by Evidence [9]. Because we rely on a compiler to produce
the C code, we can easily compare PREM and non-PREM implementations
(we simply need to recompile the Prelude program with different options).
Because we rely on reconfigurable hardware (the FPGA), we can easily compare
architectures with scratchpad memory or with cache memory.
2 Related works
Decoupling communications from computations, so as to improve timing pre-
dictability, was first proposed in the PRedictable Execution Model (PREM)
approach [23]. PREM was first designed for improving timing predictability of
2
I/O peripheral accesses for a single core hardware, but was then extensively
studied in the context of multi-core hardware where cores contend for shared
resources.
The majority of works on PREM task models concerns timing analysis.
In [6], authors have demonstrated the benefit of using PREM to reduce the
pessimism of WCET analysis. The problem of designing scheduling algorithms
and schedulability analysis for PREM tasks has drawn a lot of attention [1–3,
5,14,15,18,25,30–32]. Schedulability analysis is out of the scope of the present
paper. Our work relies on classic partitioned Deadline-Monotonic scheduling,
using semaphores to implement inter-task synchronisations related to data-
communications.
Other works have focused on the implementation of PREM-compliant ap-
plications. OS-level support or hardware drivers for the execution of PREM
tasks have been proposed in [7, 27–29]. Converting legacy code into PREM-
compliant code is a non-trivial task, which requires a very good understanding
of the code to convert. Therefore, solutions have been proposed to automate
this conversion. Light-PREM [16] is a software refactoring approach to pro-
duce PREM-compliant code from legacy code, based on memory profiling tools.
In [26], authors proposed a compilation technique that produces code that exe-
cutes memory access phases in parallel with computation phases. The authors
of [11] proposed a technique for compiling a GPU kernel into PREM-compliant
code. In [17], authors present a compiler based on the LLVM infrastructure
that refactors legacy code into PREM code.
Our work is orthogonal to these approaches, in that we start from a high-
abstraction language that naturally fits with the hypotheses of the PREM
model. The Prelude language [20], belongs to the Synchronous Languages
family. Compilation of synchronous languages for distributed hardware plat-
forms was studied in [4,12,13], but with a single execution thread per CPU. Com-
pilation into multi-thread/multi-task code was proposed for Prelude in [20],
then for control-flow synchronous languages in [33, 34] and for Scade in [19].
Unlike Prelude, [19, 33, 34] do not target systems with multiple periodicity
constraints. In [21], a first version of PREM-compliant code generation for
Prelude was proposed. However, it relies on a non-preemptive schedule com-
puted off-line, and executed in bare metal. In comparison, the present work
relies on preemptive on-line scheduling using the Erika OS.
3 Model
3.1 Hardware model
We consider a multi-core architecture with distributed memory. Each processor
ρi ∈ Π has access to a global shared memory MG and to a private memory
Mi. We assume a static allocation of code and data to SPMs. Dynamic SPM
allocation, where SPM address ranges can be shared between elements with
non-overlapping lifespans is not considered here (unlike e.g. [26]). Compared to
3
a cache-based architecture, in our case distributed memory is apparent in the
program code (local memory is explicitly addressable). Thus, memory transfers
between private and global memories are handled by the Prelude compiler.
This implies more predictable memory accesses without overburdening the pro-
grammer.
3.2 Scheduling
We assume a fixed-priority partitioned scheduler (Deadline-Monotonic in our
Erika implementation). Task partitioning and schedulability analysis are out
of the scope of this paper, so priority assignments and allocation of tasks to
processors are considered to be inputs of our model. We use semaphores to
implement task synchronisations required for data-communications purposes.
We do not allow simultaneous execution of communication and computation
phases on the same processor (unlike e.g. [26]).
3.3 Prelude
Prelude is a synchronous data-flow programming language. In comparison
to more traditional synchronous languages, it adds primitives dedicated to the
specification of real-time constraints and targets compilation into multi-task
code. A simple Prelude program, which we will use as a running example
for the rest of the paper, is provided in Figure 1. The program first declares
imported nodes, sensors and actuators, whose behaviour is programmed outside
Prelude as C functions. The main node M details the data-flows between the
previous nodes. For instance, C produces value tmp, which is used by actuator
D. Periods are specified on the inputs of the main node, (e.g. A has period 5 and
offset 0). The Prelude compiler deduces the periods of node calls from the
period of their inputs (hence, the data-flow nature of the language). In addition,
it requires the inputs and outputs of an imported node call to be synchronous,
i.e. to have the same rate. Since the inputs/outputs of node C initially have
different rates, we use rate transition operators to make them synchronous: A/^2
produces a flow twice slower than A. Similarly tmp*^2 produces a flow twice
faster than tmp, so the compiler infers that D has rate (5, 0).
The synchronous semantics is a natural match with the PREM model, be-
cause it assumes that programs and functions repeatedly execute the following
sequence: 1) acquire all their inputs simultaneously; 2) perform computations
with no side-effect on variables other than their outputs; 3) produce all their
outputs simultaneously.
3.4 Task graph
The translation of a Prelude program into C code consists of two main steps.
First, the Prelude program is translated into a task graph. Then, the task
graph is translated into C code. The present work did not require any modifi-
cation on the first step, the reader is referred to [10] for details.
4
imported node C(i,j: int) returns (o:int) wcet 2;
sensor A wcet 1; sensor B wcet 1; actuator D wcet 1;
node M(A: int rate (5,0); B: rate (6,0)) returns (D: int)
var tmp;
let
tmp=C(A/^2, B*^3/^5);
D=tmp*^2;
tel
Figure 1: Prelude running example.
The task graph is a directed acyclic graph (T ,D). Each task τi ∈ T releases
a sequence of non-overlapping periodic jobs with a period Ti, where τ
n
i denotes
the n-th job of τi. Since we assume partitioned scheduling, each task τi ∈ T
is assigned to a processor ρi ∈ Π and may only execute on that processor. We
denote pii for that processor.
The set D defines the data-dependencies between jobs, using a model derived
from [10], which allows for data-dependencies between tasks of different periods.
We denote τni → τmj when (τni , τmj ) ∈ D, which means that τni produces data
that is used by τmj . Let Hi,j = lcm(Ti, Tj) be the hyperperiod of τi and τj . We
impose that communications follow a pattern that repeats each Hi,j , which is
commonly the case in real-time applications [10]. A dependency between two
tasks τi and τj is defined by a set of job pairs denoted Di,j , where by definition:
∀(τni , τmj ) ∈ Di,j : n < Hi,jTi ∧m <
Hi,j
Tj
. Then, we obtain D by unrolling all
Di,j across multiple hyperperiods:
D = {(τpi , τ qj ) |∃k ∈ N, τi, τj ∈ T , (τni , τmj ) ∈ Di,j ,
(p, q) = (n,m) + (k
Hi,j
Ti
, k
Hi,j
Tj
)}
The task graph for our running example is depicted Figure 2. Each two
successive jobs of τD depend on the same job of τC . One out of two successive
values produced by task τA is used by task τC . τB and τC have non-harmonic
periods: τ0C , τ
1
C , τ
2
C depend on data produced by respectively τ
0
B , τ
1
B and τ
3
B ;
this pattern repeats every HB,C .
4 Multi-phase communications
In this section, we focus on task data-dependencies, detail their semantics, and
how they translate into the PREM model. In this paper, we use the conventions
of the AER task model [8], a declination of the PREM model.
We assume that task communications follow a causal semantics. For a
given execution schedule, let begin(τni ) denote the date τ
n
i starts executing
and end(τni ) denote the date at which it completes in this schedule. Causal
communication semantics impose that:
5
AB
C D
ρ1, T=5
ρ1, T=6
ρ0, T=10 ρ0, T=5
DA,C = {(τ0A, τ0C)}
DB,C = {(τ0B , τ0C), (τ1B , τ1C), (τ3B , τ2C)}
DC,D = {(τ0C , τ0D), (τ0C , τ1D)}
Figure 2: Running example
Phase Dependency Phase Dependency
EA RA DA,C
EB RB DB,C
AC DB,C , DA,C EC DC,D
ED DC,D
Table 1: Phases and related data-dependencies
• Any job τni produces all its output data at end(τni );
• Any job τni acquires all its input data at begin(τni );
• For all τni → τmj , we must have: end(τni ) < begin(τmj )
4.1 AER phases
Let us now detail how causal communications translate into the AER model
of [8]. Each task τi is divided into three phases. During the Acquisition phase
(Ai), data is copied from MG into Mi. The Execution phase (Ei) then exe-
cutes using only Mi. Finally, in the Restitution phase (Ri), the results of the
Execution phase are copied back from Mi into MG.
In our implementation, not all tasks have A- and E- and R-phases. First,
tasks without any incoming data-dependencies, have no A-phase. Tasks without
outgoing data-dependencies, have no R-phase. Second, we say that a data-
dependency τni → τmj is local iff pii = pij , else it is distant. We also say that
a task τj is colocated with τi, iff pii = pij . If a task has only local incoming
data-dependencies, it has no A-phase (no need to use MG). The same applies
for outgoing data-dependencies and R-phases.
Phases for our running example are detailed in Table 1. For instance, τC
copies both its inputs during AC . Since τC and τD are colocated, their data-
dependencies are directly handled by EC and ED.
6
4.2 Precedence constraints
To respect the causal semantics of data-dependencies, we impose precedence
constraints between phases. Some data-dependencies impose redundant prece-
dence constraints that can safely be removed (e.g. τ0C → τ1D because τC is twice
slower than τD):
relevant(τni → τmj ) ⇔@τni → τm
′
j ,m
′ < m∧
@τn
′
i → τmj , n < n′
Let Xni → Y mj denote a phase precedence constraint (which imposes that
end(Xni ) < begin(Y
m
j )). By definition, we have:
∀Ani , Eni , Rni : Ani → Eni → Rni (1)
∀τni , τmj , relevant(τni → τmj ), pii 6= pij : Rni → Amj (2)
∀τni , τmj , relevant(τni → τmj ), pii = pij : Eni → Emj (3)
In our running example, precedence constraints between τC and τD stem
from Equation 3, and others from Equation 2.
5 AER code generation
An overview of the compilation of a Prelude program is provided in Figure 3.
The Prelude program is compiled into one C file per CPU and one C file for the
global memory MG. Each CPU code contains one function per phase allocated
to that partition, and related communication and synchronisation code (see
Figure 4 for instance). The MG code contains data shared for inter-processor
communication purposes. In addition the C application contains code not gen-
erated by Prelude: 1) for each task, a user-provided imported function, to be
executed by jobs of the corresponding E-phase; 2) the OS specific code that
integrates the generated files into the final application.
In Erika one processor assumes the master role, which is in charge of declar-
ing and initialising shared data (in our case, of handling the MG data). The
compilation of the C code produces one binary per processor ρi, to be stored
in Mi, which contains the instructions and local data of ρi. The binary of the
master processor also contains shared communication data, which the master
processor copies at boot-time into MG.
In our running example, processor ρ0 is the master processor. As an example,
the code generated for tasks τA, τC is provided in Figure 4. It is detailed in the
next sections.
5.1 Communication buffers
For each input or output of each task, the compiler allocates a working variable
in Mi that is only accessed by the phases of that task (variables suffixed by _loc
7
Prelude program
Generated
files (C)
enerated
files ( )
Generated
files (C)
ρ0
ρ1
. . .
OS +
integration (C)
Imported
functions (C)
I ported
functions ( )
Imported
functions (C)
ρ0
ρ1
. . .
BinariesinariesBinaries
ρ0
ρ1
. . .
Figure 3: Overview of the Prelude compilation chain
1 // CPU 0
2 void C_A() {
3 wait_sem(sem_A_C);
4 if
(must_wait_B_C())↪→
5
wait_sem(sem_B_C);↪→
6
7 a_loc = read_val(
8 A_C_buff,
A_C_idx);↪→
9 b_loc = read_val(
10 B_C_buff,
B_C_idx);↪→
11
12 A_C_idx += 1;
13 if
(must_change_B_C())↪→
14 B_C_idx += 1;
15 }
16 void C_E() {
17 c_out = C(a_loc,
b_loc);↪→
18 C_D_buff = c_out;
19 post_sem(sem_C_D);
20 }
// CPU 1
void A_E() {
a_loc = A();
}
void A_R() {
if
(must_write_A_C())↪→
write_val(
A_C_buff,
a_loc);↪→
if
(must_post_A_C())↪→
post_sem(sem_A_C);↪→
}
Figure 4: Generated task code of τA and τC
or _out in Figure 4). It allocates a communication buffer for each Di,j (variables
suffixed by _buff in Figure 4). If τi and τj are colocated, the buffer resides in
Mi (e.g. C_D_buff in M0), otherwise it resides in MG (e.g. A_B_buff).
In the E-phase code, the call to the imported function only operates on
working variables (e.g. Figure 4 Line 17). Before this call, we must copy input
data from communication buffers into working variables. After this call we
must copy output data from working variables into communication buffers. For
colocated communications, the copies are directly performed by the E-phase
(Line 18). For distant communication, they are performed by the A/R-phases
(Lines 7 Column 1, and 8 Column 2). We use the OS-specific functions read val
and wrive val to perform copies between Mi and MG.
8
5.2 Multi-rate communications
The Prelude compiler determines for each Di,j (see [20] for more details):
• The size of the communication buffer i_j_buff (e.g. C_D_buff is of size
2);
• A function must_change_i_j, which tells when to change the cell of
i_j_buff each τmj reads from (e.g. must_change_C_D always returns
true);
• A function must_write_i_j, which tells for each τni if it must write in
i_j_buff (e.g. must_write_A_C tells that only one out of two successive
jobs of τA writes in the buffer);
• A function must_wait_i_j, which tells if τmj must wait on the communi-
cation semaphore;
• A function must_post_i_j, which tells if τni must post on the communi-
cation semaphore.
In the Erika implementation, each data-dependency Di,j is associated with
a semaphore. Before copying the input for τj , if must_wait_i_j returns true we
wait on the associated semaphore (Figure 4 Line 4). After copying the output
for τi, if must_post_i_j returns true we post on the associated semaphore
(Line 11). This ensures that phases execution respect the precedence constraints
defined in Section 4.2.
6 Validation
We validate our work by implementing the Rosace [22] case study on an FPGA
board with two softcores, using the Erika OSEK-compliant RTOS.
6.1 Hardware platform
In order to allow the comparison between different hardware architectures, we
rely on an FPGA development board, a Cyclone III by Altera with two NIOS-II
softcores, depicted in Figure 5. The data and instruction master ports con-
nect the processor to the Avalon Interconnect Fabric, a partial crossbar with
a master/slave behaviour, which serves as a hub to access shared resources of
the board. In our case, only NIOS-II processors are masters. Each master is
only connected to a subset of slaves. Accesses from two masters to two differ-
ent slaves can execute simultaneously. Simultaneous accesses to the same slave
are arbitrated with a round-robin policy. Tightly-coupled data and instruction
ports offer private contention-free access to processor-dedicated on-chip mem-
ories. Each processor has access to a tightly-coupled memory for data and to
another for instructions. These memories serve as scratchpad memory (Mi).
9
Table 2: Size of memories for the experiments
Memory (SPM architecture) Size
Data SPM ρ0: 5kB, ρ1: 4kB
Instruction SPM ρ0: 12kB, ρ1: 8kB
Main 2kB
Memory (cache architecture) Size
Data cache 2kB
Instruction cache 4kB
Main 29kB
NIOS
CPU 0
Instr.
scratchpad 0
Data
scratchpad 0
NIOS
CPU 1
Instr.
scratchpad 1
Data
scratchpad 1
Avalon Interconnect Fabric
Timers
CPU 0
Timers
CPU 1 Shared RAM Mutex
IO
Figure 5: The hardware design.
Processors share access to an on-chip shared RAM (MG). On a real embed-
ded board, the shared memory would be an external component (e.g. SDRAM,
SRAM), with typically longer access time. Therefore, our shared RAM is con-
trolled by an artificially slower clock which mimics these slower accesses. Finally,
processors are also connected to an on-chip mutex, on-board IOs and timers,
through the master ports. We use the mutex to implement synchronisations,
because the processors do not have dedicated built-in primitives.
In addition to the scratchpad architecture we just detailed, we implement a
cache-based architecture. It features a cache on each master port, with access
performances similar to the scratchpads. The FPGA has tight space limitations
(Erika is not available on more recent FPGA boards), memory sizes are re-
ported in Table 2. Space reserved for SPM in the scratchpad-based architecture
is instead reserved for the main memory in the cache-based architecture.
6.2 OSEK-compliant code
Erika is an OSEK-compliant RTOS, so tasks must all be declared statically in
an OIL configuration file, which is generated by the Prelude compiler in our
case. The OIL file is divided into several sections, in particular:
10
• CPU_DATA sections, which describe the hardware processors (identifier,
source files, Hardware Abstraction Layer, stack address space, ...);
• TASK sections define the task set (CPU allocation, events to handle syn-
chronisations, stack size, ...). In our case, each phase is declared as a
separate TASK;
• EVENT sections which enable us to implement binary semaphores.
6.3 Rosace case study
We use the Rosace [22] case study, a longitudinal flight controller, to validate
our work. It measures the airspeed, vertical speed and altitude of the aircraft,
and controls the aircraft accordingly. We simplified parts dedicated to environ-
ment simulation (which are not meant to be embedded), so that corresponding
tasks return dummy values.
The main benefit of the case study is to demonstrate that the Prelude com-
piler can automatically generate the Rosace PREM-compliant C code for our
distributed memory platform. This also enables us to compare different hard-
ware architectures as shown in Figure 6. We use the response time of each task
for the comparison. The figure shows the speedup of PREM code on the SPM-
based architecture with respect to non-PREM on the cache-based architecture
(e.g. speedup of 2 means that SPM+PREM is twice as fast as cache+non-
PREM). We provide results for different RAM clock speeds: either the same
as the global clock (red), 4 times slower (green) or 8 times slower (blue, which
corresponds to observed latencies on an external SRAM on similar boards). We
provide mean results for 20 executions for each configuration (variance is very
low).
The observed speedup is proportional to the RAM clock. When the shared
RAM is the slowest, the average speedup is 6.29 with a standard deviation of
2.19. When the shared RAM has the same clock as the global clock, the SPM
implementation barely outperforms the cache one. The average speedup is 1.09
with a standard deviation of 0.31. This is likely due to the OS overheads of
the PREM-compliant implementation, since each phase is implemented as a
separate task.
7 Conclusion
We presented a method to translate synchronous data-flow programs into multi-
task PREM-compliant C code, targeted for execution on a multi-core platform
with distributed memory using an industrial RTOS. We validated our approach
by implementing the Rosace case study on an FPGA platform. We compared
the performance of a scratchpad-based architecture with PREM-compliant code,
with a cache-based architecture with non-PREM code. Switching between both
versions, only requires to change the compilation options. Schedulability anal-
11
2 4 6 8 10 12
engine
delta th c
delta e c
Va control 25
Vz control 25
aircraft dynamics
Vz filter
Va filter
altitude hold
Va c
h c
h filter
q filter
az filter
elevator
Same clock
Clock divided by 4
Clock divided by 8
Figure 6: Observed speedup (higher is better)
12
ysis, which requires to consider multi-rate precedence constraints in multi-core,
is left for future works.
References
[1] A. Alhammad and R. Pellizzoni. Schedulability analysis of global memory-
predictable scheduling. In Proceedings of the 14th International Conference
on Embedded Software. ACM, 2014.
[2] A. Alhammad and R. Pellizzoni. Time-predictable execution of multi-
threaded applications on multicore systems. In 2014 Design, Automation
& Test in Europe Conference & Exhibition (DATE). IEEE, 2014.
[3] A. Alhammad, S. Wasly, and R. Pellizzoni. Memory efficient global schedul-
ing of real-time tasks. In 21st IEEE Real-Time and Embedded Technology
and Applications Symposium. IEEE, 2015.
[4] P. Aubry, P. Le Guernic, and S. Machard. Synchronous distribution of
signal programs. In System Sciences, 1996., Proceedings of the Twenty-
Ninth Hawaii International Conference on,. IEEE, 1996.
[5] M. Becker, D. Dasari, B. Nicolic, B. Akesson, V. Ne´lis, and T. Nolte.
Contention-free execution of automotive applications on a clustered many-
core platform. In 2016 28th Euromicro Conference on Real-Time Systems
(ECRTS). IEEE, 2016.
[6] P. Burgio, A. Marongiu, P. Valente, and M. Bertogna. A memory-centric
approach to enable timing-predictability within embedded many-core ac-
celerators. In 2015 CSI Symposium on Real-Time and Embedded Systems
and Technologies (RTEST). IEEE, 2015.
[7] N. Capodieci, R. Cavicchioli, P. Valente, and M. Bertogna. Sigamma:
Server based integrated gpu arbitration mechanism for memory accesses.
In Proceedings of the 25th International Conference on Real-Time Networks
and Systems. ACM, 2017.
[8] G. Durrieu, M. Faugere, S. Girbal, D. G. Pe´rez, C. Pagetti, and
W. Puffitsch. Predictable flight management system implementation on
a multicore processor. In Embedded Real Time Software (ERTS’14), 2014.
[9] Erika. Erika enterprise. http://erika.tuxfamily.org/drupal/.
[10] J. Forget, F. Boniol, E. Grolleau, D. Lesens, and C. Pagetti. Scheduling
Dependent Periodic Tasks Without Synchronization Mechanisms. In 16th
IEEE Real-Time and Embedded Technology and Applications Symposium,
Stockholm, Sweden, Apr. 2010.
13
[11] B. O. Forsberg, L. Benini, and A. Marongiu. Heprem: Enabling predictable
gpu execution on heterogeneous soc. 2018 Design, Automation & Test in
Europe Conference & Exhibition (DATE), 2018.
[12] A. Girault, X. Nicollin, and M. Pouzet. Automatic rate desynchronization
of embedded reactive programs. ACM Transactions on Embedded Comput-
ing Systems (TECS), 5(3):687–717, 2006.
[13] T. Grandpierre, C. Lavarenne, and Y. Sorel. Optimized rapid prototyping
for real-time embedded heterogeneous multiprocessors. In Proceedings of
the seventh international workshop on Hardware/software codesign. ACM,
1999.
[14] C. Maia, G. Nelissen, L. Nogueira, L. M. Pinho, and D. G. Pe´rez. Schedu-
lability analysis for global fixed-priority scheduling of the 3-phase task
model. In Embedded and Real-Time Computing Systems and Applications
(RTCSA). IEEE, 2017.
[15] C. Maia, L. Nogueira, L. M. Pinho, and D. G. Pe´rez. A closer look into the
aer model. In Emerging Technologies and Factory Automation (ETFA),
2016 IEEE 21st International Conference on. IEEE, 2016.
[16] R. Mancuso, R. Dudko, and M. Caccamo. Light-prem: Automated soft-
ware refactoring for predictable execution on COTS embedded systems. In
RTCSA. IEEE Computer Society, 2014.
[17] J. Mateˇjka, B. Forsberg, M. Sojka, Z. Hanza´lek, L. Benini, and
A. Marongiu. Combining prem compilation and ilp scheduling for high-
performance and predictable mpsoc execution. In Proceedings of the 9th
International Workshop on Programming Models and Applications for Mul-
ticores and Manycores, 2018.
[18] A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, and
G. Buttazzo. Memory-processor co-scheduling in fixed priority systems.
In International Conference on Real Time and Networks Systems (RTNS),
Lille, France, 2015.
[19] B. Pagano, C. Pasteur, G. Siegel, and R. Knizek. A model based safety crit-
ical flow for the aurix multi-core platform. Proceedings ERTS2, Toulouse,
France, 2018.
[20] C. Pagetti, J. Forget, F. Boniol, M. Cordovilla, and D. Lesens. Multi-task
implementation of multi-periodic synchronous programs. Discrete Event
Dynamic Systems, 21(3):307–338, 2011.
[21] C. Pagetti, J. Forget, H. Falk, D. Oehlert, and A. Luppold. Automated
generation of time-predictable executables on multicore. In Proceedings of
the 26th International Conference on Real-Time Networks and Systems,
2018.
14
[22] C. Pagetti, D. Saussie´, R. Gratia, E. Noulard, and P. Siron. The rosace
case study: From simulink specification to multi/many-core execution. In
2014 IEEE 19th Real-Time and Embedded Technology and Applications
Symposium (RTAS). IEEE, 2014.
[23] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and
R. Kegley. A predictable execution model for cots-based embedded sys-
tems. In Real-Time and Embedded Technology and Applications Symposium
(RTAS). IEEE, 2011.
[24] R. Pellizzoni and M. Caccamo. Impact of peripheral-processor interference
on wcet analysis of real-time embedded systems. IEEE Transactions on
Computers, 59(3):400–415, 2010.
[25] B. Rouxel, S. Derrien, and I. Puaut. Tightening Contention Delays While
Scheduling Parallel Applications on Multi-core Architectures. ACM Trans-
actions on Embedded Computing Systems (TECS), 16(5s), Oct. 2017.
[26] M. R. Soliman and R. Pellizzoni. WCET-Driven Dynamic Data Scratch-
pad Management With Compiler-Directed Prefetching. In 29th Euromicro
Conference on Real-Time Systems (ECRTS 2017), 2017.
[27] R. Tabish, R. Mancuso, S. Wasly, A. Alhammad, S. S. Phatak, R. Pelliz-
zoni, and M. Caccamo. A real-time scratchpad-centric os for multi-core
embedded systems. In Real-Time and Embedded Technology and Applica-
tions Symposium (RTAS), 2016 IEEE. IEEE, 2016.
[28] R. Tabish, R. Mancuso, S. Wasly, S. S. Phatak, R. Pellizzoni, and M. Cac-
camo. A reliable and predictable scratchpad-centric OS for multi-core em-
bedded systems. In RTAS. IEEE Computer Society, 2017.
[29] S. Wasly and R. Pellizzoni. A dynamic scratchpad memory unit for pre-
dictable real-time embedded systems. In ECRTS. IEEE Computer Society,
2013.
[30] S. Wasly and R. Pellizzoni. Hiding memory latency using fixed priority
scheduling. 2014 IEEE 19th Real-Time and Embedded Technology and Ap-
plications Symposium (RTAS), 2014.
[31] G. Yao, R. Pellizzoni, S. Bak, E. Betti, and M. Caccamo. Memory-
centric scheduling for multicore hard real-time systems. Real-Time Sys-
tems, 48(6):681–715, 2012.
[32] G. Yao, R. Pellizzoni, S. Bak, H. Yun, and M. Caccamo. Global real-time
memory-centric scheduling for multicore systems. IEEE Trans. Comput.,
65(9), Sept. 2016.
[33] E. Yip, A. Girault, P. S. Roop, and M. Biglari-Abhari. The forec syn-
chronous deterministic parallel programming language for multicores. In
2016 IEEE 10th International Symposium on Embedded Multicore/Many-
core Systems-on-Chip (MCSOC). IEEE, 2016.
15
[34] S. Yuan, L. H. Yoong, and P. S. Roop. Compiling esterel for multi-core
execution. In 2011 14th Euromicro Conference on Digital System Design.
IEEE, 2011.
16
