The AXIOM platform for next-generation cyber physical systems by Theodoropoulos, Dimitris et al.
ARTICLE IN PRESS
JID: MICPRO [m5G; July 7, 2017;14:21 ]
Microprocessors and Microsystems 0 0 0 (2017) 1–16
Contents lists available at ScienceDirect 
Microprocessors and Microsystems
journal homepage: www.elsevier.com/locate/micpro 
The AXIOM platform for next-generation cyber physical systems
Dimitris Theodoropoulos a , ∗, Somnath Mazumdar b , Eduard Ayguade c , f , Nicola Bettin d ,
Javier Bueno c , Sara Ermini e , Antonio Filgueras c , Daniel Jiménez-González c , f , Carlos Álvarez 
Martínez c , f , Xavier Martorell c , f , Francesco Montefoschi e , David Oro g , 
Dionisis Pnevmatikatos a , h , Antonio Rizzo e , Paolo Gai i , Stefano Garzarella i , Bruno Morelli i , 
Alberto Pomella d , Roberto Giorgi b 
a Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH) - Crete, Greece
b Dipartimento di Ingegneria dell’Informazione e Scienze Matematiche, Università degli Studi di Siena, Italy
c Barcelona Supercomputing Center (BSC), Barcelona, Spain
d VIMAR SpA Marostica, Italy
e Dipartimento di Scienze Sociali, Politiche e Cognitive, Università degli Studi di Siena, Italy
f Computer Architecture Department, Universitat Politècnica de Catalunya, Barcelona, Spain
g Herta Security Barcelona, Spain
h School of ECE, Technical University of Crete, Chania, Greece
i Evidence Srl
a r t i c l e i n f o 
Article history:
Received 28 December 2016
Revised 24 May 2017
Accepted 29 May 2017
Available online xxx
Keywords:
Cyber-physical systems
Distributed shared memory
Programming model
Performance evaluation
Reconﬁgurable
Smart video surveillance
Smart home living
a b s t r a c t 
Cyber-Physical Systems (CPSs) are widely used in many applications that require interactions between
humans and their physical environment. These systems usually integrate a set of hardware-software com- 
ponents for optimal application execution in terms of performance and energy consumption. The AXIOM
project (Agile, eXtensible, fast I/O Module), presented in this paper, proposes a hardware-software plat- 
form for CPS coupled with an easy parallel programming model and suﬃcient connectivity so that the
performance can scale-up by adding multiple boards. AXIOM supports a task-based programming model
based on OmpSs and leverages a high-speed, inexpensive communication interface called AXIOM-Link.
The board also tightly couples the CPU with reconﬁgurable resources to accelerate portions of the appli- 
cations. As case studies, AXIOM uses smart video surveillance, and smart home living applications.
© 2017 Elsevier B.V. All rights reserved.
1
n  
t  
t  
d
v
E
(
x
t
P
s
a
i  
m  
l  
h  
i  
t  
h
0
©
 <
20
17
>
. T
hi
s 
m
an
us
cr
ip
t v
er
si
on
 is
 m
ad
e 
av
ai
la
bl
e 
un
de
r 
th
e 
C
C
-B
Y
-N
C
-N
D
 
4.
0 
lic
en
se
 h
tt
p:
//
cr
ea
ti
ve
co
m
m
on
s.
or
g/
lic
en
se
s/
by
-n
c-
n
d/
4.
0/
 . Introduction
“Cyber-physical systems (CPSs) integrate computation, commu- 
ication, sensing, and actuation with physical systems to fulﬁll
ime-sensitive functions with varying degrees of interaction with
he environment, including human interaction.” [1] . A similar def-∗ Corresponding author.
E-mail addresses: dtheodor@ics.forth.gr (D. Theodoropoulos), mazumdar@
ii.unisi.it (S. Mazumdar), eduard.ayguade@bsc.es (E. Ayguade), nicola.bettin@
imar.com (N. Bettin), javier.bueno@bsc.es (J. Bueno), sara.ermini@unisi.it (S.
rmini), antonio.ﬁlgueras@bsc.es (A. Filgueras), daniel.jimenez-gonzalez@bsc.es
D. Jiménez-González), aarlos.alvarezmartinez@bsc.es (C. Álvarez Martínez),
avier.martorell@bsc.es (X. Martorell), francesco.montefoschi@unisi.it (F. Mon- 
efoschi), david.oro@hertasecurity.com (D. Oro), pnevmati@ics.forth.gr (D.
nevmatikatos), antonio.rizzo@unisi.it (A. Rizzo), pj@evidence.eu.com (P. Gai),
.garzarella@evidence.eu.com (S. Garzarella), bruno@evidence.eu.com (B. Morelli),
lberto.pomella@vimar.com (A. Pomella), giorgi@dii.unisi.it (R. Giorgi).
g  
s  
s  
t  
a  
t  
 
v  
t  
s  
M  
ttp://dx.doi.org/10.1016/j.micpro.2017.05.018
141-9331/© 2017 Elsevier B.V. All rights reserved.
Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018nition for CPS is “an integrated framework of a network of infor-
ation processing, sensors and actuators” [2,3] . Such systems al-
ow a close interaction not only system to system, but also with
uman-system or vice-versa, and are getting ever more pervasive
n many daily life activities [4–6] . The CPS domain includes In-
ernet of Things (IoT), smart homes, smart cities, or the smart
rid. Everyday life is becoming increasingly dependent on CPS (e.g.,
mart video surveillance). Since 2008 CPS is a high priority re-
earch topic [7] . The noted challenges in designing a CPS architec-
ure are infrastructural challenges, time management, data man-
gement (the data workﬂow), proper software-hardware integra-
ion (implementational challenges) and compliance with standards.
The AXIOM project (Agile, eXtensible, fast I/O Module) pro-
ides a general framework focusing on easily mapping applica-
ions to multi-board processing platforms [8,9] . Unlike other re-
earch effort s (such as CONTREX [10] , DREAMS [11] , EM C 2 [12] ,
ultiPARTES [13] ) that focus mainly on the mixed-criticality appli-rm for next-generation cyber physical systems, Microprocessors 
 
2 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16
ARTICLE IN PRESS
JID: MICPRO [m5G; July 7, 2017;14:21 ]
Fig. 1. Proposed software stack and overview of the OmpSs support for AXIOM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
t  
t  
n  
n  
t  
a  
o  
O  
e  
s  
a  
o
 
n  
O  
s  
a  
t  
G  
ﬁ  
a  
c  
t  
c  
O  
T  
a
 
 
 
 
2
 
s  
p  
M  
f  cations, AXIOM provides a generic platform with its complete ap-
plication development suite. Despite the existence of many FPGA-
based boards, to the best of our knowledge our approach is the
ﬁrst that combines all the features (especially parallel programma-
bility, connectivity and scalability). To illustrate this, we compared
more than twenty boards needed for modern CPS applications,
most of which coming from crowd-funding initiatives (some of
which met our targets, while others did not), and present the com-
parisons in Table 1 . 
In this paper, we describe the key features of AIOM, and its
progress to date. Our contributions are: 
• We detail the software stack and programming model support
for AXIOM based on OmpSs programming model [14] .
• We illustrate in detail the low-level, inexpensive, high-speed
AXIOM-Link, and the supporting OS drivers.
• We discuss the results from our design space exploration, based
on the execution traces generated by OmpSs. OmpSs now sup-
ports instrumentation with Extrae [15] to generate Paraver
[16] traces, for cluster and FPGA executions, for further execu-
tion analysis.
• We provide a ﬁrst set of results from the project hardware pro-
totypes.
The rest of the paper is organized as follows: in Section 2 ,
we explain how the support for threads is provided using the
AXIOM stack and the OmpSs programming model together with
the proﬁle support in Section 3 ; in Sections 4 and 5 we illus-
trate the high-speed AXIOM-Link and describe the correspond-
ing OS drivers. In Section 6 we discuss our evaluation platform,
while in Sections 7 and 8 we present our application scenarios
and our experimental results. We also discuss the related works
in Section 9 and ﬁnally, we conclude the paper. 
2. Programming model of AXIOM
The AXIOM software stack is depicted in Fig. 1 (a). In this sec-
tion, we brieﬂy describe the OmpSs programming model; the ex-
tensions planned for OmpSs to spawn tasks in the FPGA-device,
and the extensions needed to support the cluster version of AX-
IOM. 
2.1. Introduction to OmpSs programming model 
The OmpSs programming model supports the execution of het-
erogeneous tasks written in OpenCL, CUDA, or a high-level C or
C++ language that can be converted to the machine language used
in GPUs or converted to the bitstream to program FPGAs. Also,Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018he runtime supports the communications within a cluster of dis-
ributed memory machines. OmpSs can target tasks to the different
odes of the cluster. From the programmer perspective, the an-
otations required for the cluster support are exactly equivalent
o the symmetric multiprocessing (SMP). Currently, both OpenCL
nd CUDA options require the programmer to provide the OpenCL
r CUDA code and use the OmpSs target clauses (similar to the
penMP target clauses) to move the data to the associated accel-
rator. In the AXIOM project, we are using the same technique to
pawn tasks to the FPGA provided there was a compiler to gener-
te the FPGA bitstream implementing the task, from C or C++ code
r bitstream available with a known interface to access the data. 
For executing tasks in the cluster version, the programmer
eeds to specify the task as plain C or C++ code. Execution on the
mpSs@cluster version automatically allows the runtime system to
pawn tasks to remote nodes. The programming model allows par-
llelizing applications on the AXIOM cluster and spawn tasks on
he FPGAs available on each board. Using OmpSs@cluster with FP-
As support, programmers express two levels of parallelism. The
rst level of parallelism targets the AXIOM-cores, i.e. the cores that
re available on the AXIOM-board (e.g., the ARM-A9 cores in the
ase of a Xilinx Zynq SoC). Tasks at this level are spread across
he AXIOM boards as if they would be executed on an SMP ma-
hine. The second level of task parallelism is expressed through the
mpSs extensions targeting the FPGAs (see below, Section 2.1.1 ).
he OmpSs programming model is based on two main components
nd some additional tools. They are: 
• The Mercurium compiler [17] takes the source code and under-
stands the OmpSs directives to transform the code to run on
heterogeneous platforms, including OpenCL and CUDA, acceler-
ators. For AXIOM the compiler has been extended to generate
and support FPGA-based accelerators.
• The Nanos++ runtime system [18] , which is the responsible to
manage and schedule parallel tasks, respecting their depen-
dencies, transferring the data needed to/from the accelerators
when needed, and the lower-level interactions.
• Additionally, OmpSs can use the Extrae tool [15] to generate ex-
ecution traces that can be later visualized with the Paraver tool
[16] , and analyze the execution behavior.
.1.1. OmpSs extensions for FPGAs 
OmpSs was extended to support the Zynq chip with the FPGA
elected in the AXIOM project. The main extension to the OmpSs
rogramming model to provide support for these chips in the
ercurium compiler is to incorporate a new target device named
pga , in addition to the current smp, cuda and opencl devices.rm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 3
ARTICLE IN PRESS
JID: MICPRO [m5G; July 7, 2017;14:21 ]
T
a
b
le
 
1
C
o
m
p
a
ri
so
n
 
o
f 
re
ce
n
t 
F
P
G
A
-b
a
se
d
 
b
o
a
rd
s 
(A
X
IO
M
 
re
la
te
d
 
a
sp
e
ct
s 
a
re
 
in
 
b
o
ld
).
B
o
a
rd
s
F
P
G
A
/C
P
U
R
A
M
S
ta
n
d
a
lo
n
e
A
T
M
E
L 
B
a
se
d
 
A
rd
u
in
o
C
o
n
n
e
ct
iv
it
y
P
ro
g
ra
m
m
a
b
il
it
y
L
O
G
i 
F
P
G
A
S
p
a
rt
a
n
6
 
LX
9
2
5
6
M
B
N
o
P
in
o
u
t
S
A
T
A
, 
R
a
sp
y
/B
e
a
g
le
, 
S
P
I
ID
E
, 
G
U
I
M
in
iS
p
a
rt
a
n
6
 + 
S
p
a
rt
a
n
6
 
LX
9
/2
5
3
2
M
B
Y
e
s
–
I/
O
 
p
o
rt
s,
 
D
A
C
, 
A
D
C
ID
E
, 
G
U
I
P
a
p
il
io
 
D
U
O
S
p
a
rt
a
n
6
 
LX
9
2
M
B
Y
e
s
A
T
m
e
g
a
3
2
U
4
, 
M
e
g
a
 
P
in
o
u
t
I/
O
 
p
o
rt
s
ID
E
, 
G
U
I
M
O
JO
S
p
a
rt
a
n
6
 
LX
9
–
Y
e
s
A
T
m
e
g
a
3
2
U
4
, 
C
u
st
o
m
 
P
in
o
u
t
–
ID
E
, 
G
U
I
S
m
a
rt
Z
y
n
q
Z
y
n
q
 
7
0
1
0
/7
0
2
0
1
G
B
N
o
–
F
a
st
 
n
e
tw
o
rk
 
a
n
d
 
b
o
a
rd
2
b
o
a
rd
H
a
rd
P
a
ra
ll
e
ll
a
Z
y
n
q
 
7
0
1
0
/7
0
2
0
 
1
6
-c
o
re
 
E
p
h
ip
h
a
n
y
1
G
B
Y
e
s
–
G
ig
a
b
it
 
e
th
e
rn
e
t,
 
F
o
u
r 
h
ig
h
 
sp
e
e
d
 
co
n
n
e
ct
o
rs
S
ta
n
d
a
rd
 
to
o
ls
a
ij
u
b
o
a
rd
Z
y
n
q
 
7
0
1
5
1
G
B
Y
e
s
–
S
A
T
A
, 
G
ig
a
b
it
 
e
th
e
rn
e
t
S
ta
n
d
a
rd
 
to
o
ls
R
E
D
 
P
IT
A
Y
A
Z
y
n
q
 
7
0
1
0
5
1
2
M
B
Y
e
s
–
G
ig
a
b
it
 
e
th
e
rn
e
t,
 
4
 
fa
st
 
a
n
a
lo
g
 
in
p
u
ts
S
ta
n
d
a
rd
 
to
o
ls
O
H
O
S
p
a
rt
a
n
 
3
E
–
Y
e
s
–
I/
O
 
p
o
rt
s
X
il
in
x
 
IS
E
 
o
n
ly
R
e
tr
o
C
a
d
e
 
S
y
n
th
S
p
a
rt
a
n
 
3
E
 
o
r 
LX
9
4
 
M
B
N
o
–
A
n
a
lo
g
 
a
n
d
 
d
ig
it
a
l 
in
p
u
ts
, 
M
ID
I,
 
a
u
d
io
 
ja
ck
s
–
P
A
P
IL
IO
S
p
a
rt
a
n
 
3
E
8
 
M
B
Y
e
s
–
I/
O
 
p
o
rt
s
N
o
 
S
D
K
T
R
IF
D
E
V
La
tt
ic
e
 
M
A
C
H
X
O
2
-1
2
0
0
–
Y
e
s
P
a
rt
ia
l 
p
in
o
u
t,
 
I2
C
I/
O
 
p
o
rt
s
N
o
 
S
D
K
o
w
lB
o
a
rd
S
p
a
rt
a
n
6
 
LX
9
–
Y
e
s
–
I/
O
 
p
o
rt
s
N
o
 
S
D
K
A
la
n
S
p
a
rt
a
n
6
 
LX
4
5
–
Y
e
s
A
T
m
e
g
a
3
2
U
4
 
a
n
d
 
P
in
o
u
t
I/
O
 
p
o
rt
s
A
rd
u
in
o
 
ID
E
, 
X
il
in
x
 
IS
E
4
C
H
 
si
g
.g
e
n
.
S
p
a
rt
a
n
6
 
LX
9
–
Y
e
s
–
4
 
D
A
C
N
o
n
e
L
o
g
it
ra
x
x
S
p
a
rt
a
n
6
 
LX
9
6
4
 
M
B
Y
e
s
S
h
ie
ld
 
C
o
m
p
a
ti
b
le
I/
O
 
p
o
rt
s
N
o
 
S
D
K
K
ro
m
a
L
ig
h
ts
S
p
a
rt
a
n
6
 
LX
9
 
C
o
rt
e
x
-M
3
2
5
6
 
M
B
Y
e
s
A
rd
u
in
o
 
D
u
e
 
(S
A
M
3
X
)
LV
D
S
, 
C
A
N
, 
U
S
A
R
T
S
D
K
 
(A
rd
u
n
io
 
ID
E
)
C
ry
st
a
lB
o
a
rd
S
p
a
rt
a
n
6
 
LX
9
 
4
-c
o
re
 
C
o
rt
e
x
-A
9
2
 
G
B
Y
e
s
A
tm
e
g
a
3
2
8
, 
U
N
O
 
P
in
o
u
t
E
th
e
rn
e
t,
 
W
iF
i
–
P
S
H
D
L
 
b
o
a
rd
A
ct
e
l 
A
3
P
N
2
5
0
–
Y
e
s
A
tm
e
l 
X
M
e
g
a
3
2
 
(t
o
 
p
ro
g
ra
m
 
F
P
G
A
)
U
A
R
T,
 
I/
O
 
p
o
rt
s
S
im
p
li
ﬁ
e
d
 
V
H
D
L
H
e
li
x
-4
A
lt
e
ra
 
C
y
cl
o
n
e
 
4
 
(2
2
k
)
4
 
M
B
Y
e
s
A
rd
u
in
o
 
U
N
O
 
sh
ie
ld
I/
O
 
p
o
rt
s
A
lt
e
ra
 
Q
u
a
rt
u
s 
II
 
ID
E
Z
y
n
q
B
e
rr
y
Z
y
n
q
 
7
0
1
0
1
2
8
 
M
B
Y
e
s
–
E
th
e
rn
e
t
X
il
in
x
 
S
D
K
Z
-t
u
rn
 
B
o
a
rd
Z
y
n
q
 
7
0
1
0
/7
0
2
0
1
 
G
B
Y
e
s
–
C
A
N
, 
E
th
e
rn
e
t
X
il
in
x
 
S
D
K
T  
t  
v  
O  
f
 
a  
p  
t  
w  
Z  
c  
p  
s  
t  
t  
m
2
 
f  
m
2
 
l
 
 
 
 
t  
d  
t  
c  
t
 
a  
w  
a  
b  
e  
K  
t  
m  
f  
h  
u  
a
2
 
t  
o  
l  
Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018he fpga device will cause the Mercurium compiler to understand
hat the function annotated is to be compiled with the Xilinx Vi-
ado HLS compiler for the FPGA in order to generate the bitstream.
ther extensions may be necessary as the number of accelerators
or each accelerator type. 
Fig. 1 (b) shows the main phases of the bitstream generation
nd compilation of the OmpSs code. With this extension, the com-
iler generates the code for the runtime system specifying the
asks that should be run in the FPGA device. The code is compiled
ith a back-end compiler (e.g., gcc) that will be executed in the
ynq-ARM cores. This binary code ( OmpSs.elf in Fig. 1 (b)) will
all the Nanos++ runtime with FPGA execution support. This sup-
ort is based on the DMA library and the FPGA-DMA driver in the
ystem. Indeed, the tasks with target device(fpga) are ex-
racted, modiﬁed and, using the Xilinx toolchain transparently to
he programmer, the bitstream with the FPGA accelerators is auto-
atically generated. 
.2. Runtime support 
The runtime support has two parts: i) ﬁrst part is responsible
or the FPGA-based execution, ii) second part for cluster environ-
ent. 
.2.1. FPGA runtime support 
The Nanos++ runtime system has also been extended, in the fol-
owing ways: 
• Support to spawn tasks in the FPGA device.
• Support for the target clauses related to data transfers. Data-
copy clauses ( copy_in, copy_out, copy_inout ) trigger
the data transfer of the data speciﬁed to/from the FPGA device.
Also, dependence clauses will trigger data transfers to the de-
vice by default.
• Support for data transfers to/from the FPGA. The Nanos++ run-
time now invokes the services of the DMA library developed to
transfer data in the FPGA environment.
• Include the FPGA device in the support of the implements
clause to allow several implementations of tasks to be sched-
uled in the available processors/devices.
In terms of FPGA support, the DMA library interface provides
he means to interact with the Linux driver supporting the FPGA
evice. In the current prototype, when the data transferred is to
he FPGA hardware, the IP kernel is initiated automatically. The
omputation on the data proceeds to the end, and after ﬁnishing,
he results can be read back to the host from the FPGA. 
The main DMA library primitives allow to get the number of IP
ccelerators present in the FPGA device, and the handles to operate
ith them. For each IP accelerator, the library allows to open input
nd output DMA channels to send/receive data to/from it. The li-
rary allows to allocate special memory buffers in kernel space to
xchange data between the Linux kernel and the FPGA hardware.
ernel buffers are pinned to physical memory to avoid swapping
hem out, while a DMA transfer is in progress. Buffers can be sub-
itted for a DMA transfer to/from the speciﬁed device. Data trans-
ers can be monitored to determine if they are in progress, they
ave ﬁnished, or a transfer error has occurred. This interface is
sed by the Nanos++ runtime system to drive the work of the IP
ccelerators in the FPGA. 
.2.2. Cluster runtime support 
The OmpSs@Cluster [19] approach uses a communication layer
o launch tasks to remote nodes. Task descriptors and data travel
n the communication layer. In our current implementation, this
ayer is GASNet [20] , usually running on top of MPI [21] throughrm for next-generation cyber physical systems, Microprocessors 
 
4 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 2. OmpSs directives on matrix multiplication. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
i  
n  
p  
o  
t  
a  
a  
e
 
 
 
 
 
 
 
a  
T  
t  
b
4
4
 
t  
t  
c  
I  
m  
s  
i  
c  
a  
l  
n  
w  
c  
m
 
n  
a  
c  
s  
h  
l  
a
 
b  
a  
w  
t  
a  
N  
d  
L  
r  an Ethernet link. Next step is to provide the runtime with a com-
munication layer that can exploit the high-speed dedicated inter-
connection AXIOM-Link (see Fig. 1 (a)) using the AXIOM network
interface explained in Section 4 . 
2.3. Ompss coding example 
Fig. 2 shows an example of matrix multiplication that has been
annotated with OmpSs directives. Note that this code is indepen-
dent of the execution platform (i.e., cluster, nodes with FPGAs,
nodes with GPUs.), being the runtime responsible for taking care of
the task execution scheduling of the tasks to the devices or nodes
of the cluster, transparently to the programmer. In particular, this
code shows a parallel tiled matrix multiply where each of the tiles
( BS × BS sub-matrix) is a task. A, B and C are NB × NB matrices of
pointers to BS × BS sub-matrices. 
Each of those tasks has two input dependencies and an output
dependence that will be managed at runtime by Nanos++. Those
tasks will be able to be scheduled/ﬁred to a SMP or FPGA , as it is
annotated in the target device directive, depending on the resource
availability. The copy_deps clause associated to the target di-
rective hints the Nanos++ runtime to copy the data related to the
input and output dependencies to/from the device when necessary.
3. Proﬁling support 
The current implementation provides support to proﬁle and
trace cluster execution. At the same time, a new hardware trac-
ing mechanism allows to proﬁle and trace basic information from
fpga tasks. Traces are automatically generated and translated to
Paraver traces if speciﬁed at execution time. Those traces include
both application and OmpSs runtime execution state information
so that the programmer can analyze the parallel execution behav-
ior to detect potential performance bottlenecks. In Section 8 some
trace results are presented and discussed. Those results uncover
the need for hardware proﬁling support. Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018.1. Hardware proﬁling support 
A new hardware support for FPGA proﬁling and tracing (from
nside FPGA) for high-level languages has been introduced. This
ew feature is, to the best of our knowledge, novel for task-based
arallel heterogeneous programming. The support is in the process
f being integrated into the FPGA-task acceleration in OmpSs and
he support is transparent to the programmer. The ﬁrst proﬁling
nd tracing objective is to have input and output memory transfer,
nd computation information from inside the OmpSs fpga task
xecution. With this aim, the idea is to: 
• Create a hardware platform that integrates hardware proﬁling
counters that can be read from both SMP cores and fpga ac-
celerated tasks, transparently to the programmer. 
• Create hardware counters that do not affect the performance of
the fpga tasks. 
• Make the fpga tasks return the proﬁling information as part
of their outputs, transparently to the programmer. 
• Interpreting the proﬁling information in the OmpSs runtime de-
vice dependent layer, transparently to the programmer. 
• Include the proﬁling information to the automatically generated
Paraver trace. 
Our implementation has used the OMPT API [22] to gener-
te the execution traces using the Extrae instrumentation tool.
he OMPT API helps to integrate proﬁling of different accelera-
ors/devices and CPUs using the same API that can be supported
y different instrumentation tools. 
. The AXIOM network interface 
.1. The network interconnect controller 
After an initial exploration of the 32-bit Xilinx Zynq platform,
he AXIOM platform is now designed around the Xilinx Zynq Ul-
rascale+ SoC that features a quad-core ARM A53 processor Appli-
ation Processing Unit (APU) tightly coupled with FPGA fabric. AX-
OM is designed to be modular at the next level, allowing the for-
ation of more eﬃcient processing systems through low-cost, but
calable high-speed interconnect. The interconnect will utilize the
ntegrated gigabit-rate transceivers with relatively low-cost USB-C
onnectors to interconnect multiple boards. Such connectivity will
llow users to build (or upgrade at a later moment) ﬂexible and
ow-cost systems by cascading more AXIOM boards, without the
eed of costly specialized connectors and cables. AXIOM boards
ill feature two or four bi-directional links, so that the nodes
an be connected in many different ways, such as ring and 2D-
esh/torus. 
Fig. 3 illustrates the network interface (NI) architecture, origi-
ally introduced in [9] , which implements remote direct memory
ccess (RDMA) and remote write operations (raw data) as basic
ommunication primitives visible at the application level. It con-
ists of a queue set for RDMA and raw data messages, a set of
ardware counters, control/status registers, a DMA engine, a low-
evel packet router, and two internal controllers for transmitting
nd receiving messages. 
As depicted in Fig. 4 , the message descriptor types handled
y the NI can be divided in two main categories: raw messages,
nd RDMA transactions messages. Raw messages are messages for
hich the Network interface provides message buffers directly in
he FPGA memory region. Their length is up to 128 bytes and they
re used either to direct a message to a speciﬁc node (using the
ode ID), or to a neighbor interface using the interface ID, in or-
er to provide a way to implement a discovery algorithm in the
inux OS. RDMA transactions can be of three kinds: RDMA Read
equests (where a node asks a copy of the memory on a remoterm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 5 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 3. The AXIOM network interface architecture. 
Fig. 4. Supported message descriptor types by the AXIOM interconnect. 
Fig. 5. The AXIOM router pipelined architecture. 
n  
o  
q  
r  
m  
s
 
N  
t  
m  
c  
c  
R  
a  
n  
(  
t  
S  
m
Fig. 6. AXIOM boards interconnected in 2D-mesh. 
 
n  
f  
(  
t  
e  
a
 
i  
i  
e  
i  
c  
p  
t  
S  
r  
t
 
f  
s  
i  
a  
s  
n  
l
 
b  
i  
l  
s
 
 
 
 
 
d  
m  
m  
n  
t  
w  
s  
c  
t  
t  
q  
node), RDMA Write requests (where a node writes on the memory
f a remote node), and LONG messages (which are RDMA write re-
uests for which the destination remote address is decided by the
emote node by taking it from a set of preallocated buffers). All
essages have a port speciﬁcation, which is used in the software
tack to provide separate reception queues. 
To send a new message, the OS posts its descriptor to the
I queues. The local hardware counters are used to register
he progress of RDMA requests (described in Section 4.2 ). The
emory-mapped control/status registers can be used by the OS to
onﬁgure notiﬁcation parameters (e.g., acks at OS-level upon suc-
essful packet transmission and IRQs), and monitor the progress of
DMA requests, respectively. The DMA engine is used for loading
nd storing data from/to the local SDRAM. The local APU commu-
icates with its local NI via the Master High-Performance Port 0
MHP0) and Slave High-Performance Coherent Port 0 (SHPC0) in-
erfaces. MHP0 is used to access the control/status registers, while
HPC0 allows fast and coherent data storage to the local external
emory. Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018The router module shown in Fig. 5 implements the routing and
etwork discovery processes. The AXIOM routing algorithm will
eature store-and-forward packet transmission with virtual circuits
VCs), and the network discovery process will be initiated at boot
ime by the master node of network. After the process completion,
very node will have its id, and local routing table, based on which
ll packets will be forwarded to output links. 
The core router components can be outlined as: i) input buffer-
ng, ii) control, and iii) crossbar and link traversal. The input buffer-
ng module consists of four link controllers (LC), where each link
mploys queues to implement three VCs to store different prior-
ty packets. The router uses a Xon/Xoff strategy for notifying adja-
ent nodes on VC input buffer availability. If a VC queue reaches a
redeﬁned threshold, the router instantly transmits a Xoff packet
o the link’s adjacent node to block further packets transmission.
imilarly, when the VC fullness drops below a certain level, the
outer instantly transmits a Xon packet to the link’s adjacent node
o resume packets transmission via this particular link. 
The route calculation (RC) ﬁnds the required output interface
or a packet, based on the routing table and destination node,
tarting from the highest VC. If the VC number of the output link
s enabled, then the packet is forwarded to the corresponding VC
llocation (VCA). For each input link, the VCA always attempts to
erve the VC with the highest priority, except if its destination
ode input VC buffer is blocked. In that case, it falls to the next
ower input VC. 
During the switch allocation process, the packets from each
uffer request a Xbar output. The switch allocation pairs the Xbar
nputs to the Xbar outputs as eﬃciently as possible, trying not to
eave an output link idle. If more than one packets request the
ame output link, the grant policy decides according to: 
• Priority (Xon/Xoff > VC2 > VC1 > VC0). 
• If packets are of the same priority (e.g., both VC2), it chooses
one (in a round-robin based fashion) to grant an output port,
while at the same time looks for available packets of lower pri-
ority (VC1 or VC0) on the same input link that requires a dif-
ferent port. 
• Repeat until all packets are served. 
The use of VCs with priorities ensures that we avoid protocol
eadlocks in the network. As we use low priorities for requests,
edium priorities for responses and a top priority for acknowledg-
ents, there is no possibility for high-priority packets to clog the
etwork as their priority will be less or equal than the requests
hat were accepted by the network. Thus in the case of a high net-
ork congestion, acknowledgments will exit the network, then re-
ponses will be sent, and then more requests will be accepted. The
rossbar module is responsible for forwarding all available packets
o their output links. All packets then traverse via the physical link
o the neighbor node and are stored to the corresponding VC input
ueue. Finally, Fig. 6 shows how AXIOM boards can be intercon-
ected in 2D mesh. rm for next-generation cyber physical systems, Microprocessors 
 
6 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 7. Neighbor/raw data messages ﬂow. 
Fig. 8. Long/RMDA write messages ﬂow. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 9. RMDA read messages ﬂow. 
Fig. 10. The AXIOM network stack architecture. 
a  
a
5
 
o  
I
 
d  
p  
i  
a  
d  
t
5
 
d  
l
 
 
 
 4.2. Packet ﬂow 
Fig. 7 illustrates the ﬂow of transmitting raw/neighbor mes-
sages. The OS hosted on node N0 posts a Tx neighbor/raw descrip-
tor to the NIC queues for transmission. The NIC controller pops the
descriptor, creates a raw packet and transmits it via VC1 to N1.
When the N1 NIC receives the packet, it posts an Rx neighbor/raw
descriptor that the N1 OS reads and extracts the payload. 
Fig. 8 depicts the procedure of transmitting long/RDMA write
messages between node N0 and N1. The N0 OS posts a Tx
long/RDMA write descriptor to the NIC queue. The NI internal
controller parses the descriptor and transmits to N1 via VC1 an
init packet that designates the data payload size, followed by
long/RDMA write packets, each “carrying” a subset of the re-
quested data. On the N1 side, when the init packet is received,
the local NI associates a local hardware counter with the corre-
sponding message, and initializes its value to the number of ex-
pected long/RDMA write packets. As soon as all packets are re-
ceived, the NI posts an Rx long/RDMA write descriptor that the N1
OS can parse it to retrieve the requested data. Moreover, the N1
NIC sends an ack packet via VC2 to N0 with the total number of
bytes received. When the N0 NIC receives it, if already conﬁgured
by the OS, it can post an Rx descriptor to signal the N0 OS that the
long/RDMA write is successfully transmitted. 
Finally, Fig. 9 shows the transmission procedure of RDMA read
messages between node N0 and N1. The N0 OS posts a Tx RDMA
read descriptor to the NIC queue. The NI controller parses the de-
scriptor and transmits via VC0 an RDMA read request packet. On
the N1 side, when the RDMA read packet is received, the local
NI essentially follows the RDMA write procedure described above,
in order to transmit all requested data to N0. N0 associates a lo-
cal hardware counter with the corresponding message, and initial-
izes its value to the number of expected long/RDMA write packets.
Again, when all packets are received, the N0 NI transmits an ack
packet to N1 with the number of bytes that were received, andPlease cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018lso posts an Rx RDMA read descriptor to signal the local OS that
ll requested data have arrived to the designated address. 
. The AXIOM network drivers 
The AXIOM network stack is a software stack developed on top
f Linux which is meant to provide an eﬃcient access to the AX-
OM NIC features, including Remote DMA transfers. 
The network stack (see Fig. 10 ) is composed by a Linux kernel
river that provides a proper interface for the user libraries, ex-
osing high-level constructs that are then mapped into the NI reg-
sters. An additional kernel driver also takes care of the memory
llocation (see Section 5.2 ). In the user space, a set of libraries and
aemons (see Section 5.3 ) are used to provide user space services
o the AXIOM applications. 
.1. Network interface kernel drivers 
The architecture of the kernel drivers handling the network is
epicted in Fig. 11 . The main components of the stack are the fol-
owing: 
• A set of software queues (one per port) for the small messages;
• A set of software queues (one per port) for the descriptors of
the long messages; 
• A RDMA queue to store the descriptors of the RDMA requests; 
• A pool of descriptors that are pre-allocated by the driver to
be used to automatically allocate long messages upon their ar-
rivals; rm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 7 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 11. The AXIOM network stack architecture. 
 
 
 
 
u
 
a  
a  
m  
r  
i  
u  
s
5
 
s  
c
 
W  
o  
w  
i
 
o  
p  
S  
r  
o  
a  
s  
a
 
t  
v  
r  
m  
g  
a  
o  
c  
s
 
F
Fig. 12. The three levels of the AXIOM Allocator. 
 
l  
o  
m  
i  
m  
m
 
o  
i  
o  
p  
m
 
v  
s  
g  
a  
1
5
 
a
 
d  
s  
i  
t  
ﬁ  
a
 
 
 
 
 
 
 
 
 
 
 • A set of kernel threads that are responsible for polling the in-
coming message queues, demultiplexing their content into local
kernel-level buffers, and for ﬁlling the long message descriptor
FIFOs. 
All these components exports custom IOCTL commands to
serspace applications (see Section 5.3 ). 
An additional kernel driver is dedicated to the memory man-
gement. The main idea of the driver is that each board dedicates
 contiguous physical memory range to RDMA transactions. That
emory range is handled by the AXIOM memory driver, which is
esponsible of assigning subsets of that memory range to the var-
ous user processes. These per-process memory assignments are
sed by the AXIOM memory allocator to provide dedicated and
hared memory across the AXIOM cluster. 
.2. Memory allocator 
The AXIOM memory allocator is responsible for the memory
ubsystem used by the AXIOM drivers and by the AXIOM appli-
ations. 
The starting point is the physical memory available to all nodes.
e make the hypothesis that each node has a physical RAM mem-
ry mapped at similar addresses (this is true in the case of AXIOM,
here the cluster is composed by homogeneous nodes). The phys-
cal memory available is at least partly RDMA-addressable. 
The main idea behind the AXIOM allocator is to handle a part
f the memory of each node in the AXIOM cluster in a way com-
atible with the RDMA support of the AXIOM NIC described in
ection 4.1 . This is solved on each node by reserving a dedicated
ange of contiguous physical memory; that reserved memory is
utside the memory range directly managed by the Linux kernel,
nd is then managed by the AXIOM allocator which will be respon-
ible to map memory regions to the various processes composing
n AXIOM application. 
The kind of memory that can be allocated by the AXIOM alloca-
or is either a private memory or a shared memory. Allocating pri-
ate memory guarantees unique address ranges only on the node
equesting it (that is, two nodes may end up allocating private
emories at the same virtual address). Allocating shared memory
uarantees that the range of memory allocated is unique among
ll the AXIOM cluster . Note that the allocator provides guarantees
n the uniqueness of the address in the cluster, but not on its
oherency or synchronization, which is guaranteed by the higher
oftware layers based on DF-threads or OmpSs/GASNet. 
The AXIOM allocator is internally composed by three levels (see
ig. 12 ): Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018Level 1 is responsible for reserving regions of memory at cluster
evel. The idea is that this reservation is only inquired at start/end
f an application to reserve the maximum (shared or private)
emory used by the application. This level is implemented partly
nside the device driver (to enforce memory mapping and correct
emory addressing), and in the axiom-init application (on the
aster node of the cluster, see later). 
Level 2 is responsible to allocate macro-blocks of shared mem-
ry to speciﬁc nodes. In other words, this coarse-grained allocator
s responsible for guaranteeing that the allocation of shared mem-
ry will return unique addresses at cluster level. This level is im-
lemented inside the axiom-run application (on the application
aster node, see later). 
Level 3 is ﬁnally responsible of each single allocation of pri-
ate/shared memory. Various allocators can be supported at this
tage, such as LMM [23,24] and in the future TLSF [25] . These ﬁne
rained allocators will work locally on each node providing quick
nd eﬃcient allocation of memory that has been reserved by Level
 and 2. 
.3. User space applications and service libraries 
The AXIOM Network drivers have been developed together with
 set of applications that complement the driver functionality. 
The ﬁrst application we will describe is the axiom-init Linux
aemon. The idea is that axiom-init includes in user space
ome of the services that normal networks (like TCP/IP) include
n their kernel layers. This approach has the advantage to limit
he size of the AXIOM drivers, leaving everything which is con-
guration dependent in user space (thus allowing easier changes).
xiom-init is responsible for the following services: 
• It handles the initialization part of the cluster. In particular, it
is responsible for running the AXIOM discovery algorithm (used
to discover the topology of the network and set up the IDs of
the nodes), for computing the node routing tables and setting
them on each node of the cluster. 
• It is responsible for a set of diagnostic protocols that are han-
dled with a set of separate applications like axiom-ping (note
that in TCP/IP implementations these services are typically im-
plemented in the kernel driver; in AXIOM, axiom-init pro-
vides the same kind of support but in user space). 
• It handles the cluster level synchronization needed by the Level
1 of the AXIOM allocator. In particular, it stores the data struc-
tures that keep track of the various allocations in the ﬁrst node
of the cluster (named master node of the cluster). rm for next-generation cyber physical systems, Microprocessors 
 
8 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table 2 
Comparison of COTSon with other DSE environments. 
Features Sniper Graphite Gem5 MARSx86 COTSon 
Timing directed No No Yes No No 
Functional directed Yes Yes No Yes Yes 
User level Yes Yes Yes No No 
Full system simulation No No Yes Yes Yes 
Parallel (In node) Yes Yes No No No 
Parallel (Multi-node) No No No No Yes 
Shared cache Yes No Yes Yes Yes 
“  
a  
n  
A  
a  
n  
S  
a  
p  
o
6
 
ﬁ  
t  
w  
b  
t  
t  
h  
p  
t  
b  
o  
a
 
 
 
 
 
 
 
 
 
 
7
 
a  
o
 
s  • It provides a set of services for starting applications, like pro-
viding the application ID, as well as a service to spawn pro-
cesses in the cluster. 
Another set of applications is then provided to implement a
simple messaging and diagnostic interface. In this list, we include: 
• axiom-info , which is used to provide information on the
node ID, on the routing table on the local node, and on the set
of interfaces available on the node. 
• axiom-traceroute , axiom-netperf , axiom-ping ,
which are used to provide services similar to their Unix
counterparts. 
• axiom-send , axiom-recv , axiom-rdma , which are used to
send and receive long, raw messages, and to trigger test RDMA
operations. 
Finally, another fundamental application in the AXIOM stack is
axiom-run . axiom-run is used to provide a set of services to
AXIOM applications: 
• An AXIOM application running on the cluster is composed by a
single executable which is run once for each node of the clus-
ter. axiom-run provides the AXIOM application startup proce-
dure, allowing the possibility to spawn a process on a subset of
the nodes of the cluster. In order to do that, it uses the spawn
service provided by axiom-init to start a process on a single
node. 
• It provides the support for application termination. In partic-
ular, in case one of the processes spawned on a node termi-
nates, it is responsible to terminate all the other processes on
the nodes of the cluster. 
• It provides a standard output redirection service for the
spawned processes. In other words, the standard output
of the spawned process is captured by slave executions of
axiom-run on each node, and redirected to the master appli-
cation node (that is the node from where the user started the
application initially). 
• It provides other additional services to the AXIOM applications
such as a synchronization barrier for all nodes running the ap-
plication, and the Level 2 of the AXIOM allocator (which re-
quires synchronization only at application level). These addi-
tional services are located in the ﬁrst node used by the appli-
cation, which is named application master node . 
axiom-run and axiom-init are provided together with
their respective user libraries. These user level libraries are used
to let third party applications interact with them in a simpler way.
6. AXIOM Evaluation Platform (AEP) 
Design space exploration (DSE) and its automation is an impor-
tant part of our current performance evaluation and power esti-
mation methodologies [26–28] . The proposed method in AXIOM
requires ﬁrst exploring and modeling parts on the simulator and
then, once the DSE is completed, implementing them on the FPGA-
based prototypes. This has the considerable advantage of allowing
immediately to develop the software stack early. AEP is made of
two important tools: the HP-Labs COTSon simulator [29] and the
Xilinx Zynq based platform. Given the goals of this project, we also
needed a more ﬂexible platform for the DSE. The simulation plat-
form is used to understand better bottlenecks (e.g., the congestion
on a bus, cache size), which are not trivial to track on the FPGA
prototyping platform. COTSon also includes an interface to the HP
McPAT tool [30] for estimating the power consumption. Table 2
presents some advantages of using COTSon for our purpose. 
COTSon uses the so-called “functional-directed” approach. The
simulator permitted us to execute the full-system simulation. ThePlease cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018mediator” of COTSon represents the model of a switch, and our
im is to modify it to model the behavior of our custom intercon-
ects. The motivation for multiple interconnects derives from the
XIOM project design that aims to separate the traﬃc for building
 multi-board system and the traﬃc for the internet related con-
ection. With the COTSon mediator, we can model both cases. The
imNow is the virtual machine (VM), which models all details of
 computer. AMD is also providing a separate SDK to model any
articular board that has to be plugged-in (such as a network card
r a GPU). 
.1. Thread support 
Synchronization and distribution of data can be managed ef-
ciently by reorganizing the execution in such a way that the
hreads follow more closely the data ﬂow of the program (such as
ith DF-Threads [31] ). DF-Threads can be eﬃciently implemented
y a distributed hardware thread scheduler which support fault
olerance at the hardware level and eﬃcient ﬁne grain dataﬂow
hread distribution [32] . To reduce the thread management over-
ead, the scheduling needs to be accelerated in hardware, by map-
ing its structure into the FPGA. A DF-Thread is deﬁned as a func-
ion that expects no parameters and returns no parameters. The
ody of this function can refer to data which reside at the mem-
ry location for which it has got the pointer. The DF-Thread API’s
re summarized below [33] : 
• void ∗DF_TSCHEDULE(bool cnd, void ∗ip, 
uint64_t sc) : Allocates the resources (a DF-frame of
size sc words and a corresponding entry in the distributed
thread scheduler or DTS) for a new DF-Thread and it returns
a frame pointer fp . The ip is the instruction pointer of DF-
Thread. The allocated DF-Thread is not executed until its sc
reaches 0 and together also satisfy the boolean condition cnd . 
• void DF_DESTROY() : To release allocated resources held by
current DF-Thread. 
• uint64_t DF_TREAD(uint64_t offset) : Loads the data
indexed by offset from the current thread of DF-frame. 
• void DF_TWRITE(uint64_t val, void 
∗fp,uint64_t off) : The data val is stored into the
DF-frame pointed to by fp at the speciﬁed offset off . 
• void ∗DF_TALLOC(uint64_t size, uint_8 type) : 
Allocates a block of memory of size words and returns the
pointer (or null) while type speciﬁes the special purpose
memory type. 
• void DF_TFREE(void ∗p) : Frees memory pointed to by p . 
. Application scenarios of AXIOM 
Smart video surveillance and smart home applications are now
 hot topic in CPS and we have customized these two scenarios for
ur AXIOM platform. 
A. Smart Video Surveillance (SVS) 
For SVS case study, we selected an automated smart marketing
cenario involving real-time face detection in crowds while per-rm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 9 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 13. The AXIOM smart home living (SHL) scenario. 
f  
T  
v  
o  
A  
w  
t  
h  
r  
s  
m  
o  
t
 
r  
h  
a  
s  
o  
s  
v  
a  
T  
d  
m  
a  
T  
a  
ﬂ  
T  
t  
i  
p  
r  
r  
b  
o  
u  
t  
m
8
 
s  
a
Table 3 
OmpSs experimental results: we use as many worker threads as number of cores. 
Machine w/o NEON w/ NEON Speed-up 
Time (s) GFLOPS Time (s) GFLOPS 
UDOO: 1 core (1 node) 7.6 0.28 2.90 0.74 2.6 ×
UDOO: 4 cores (1 node) 1.9 1.13 0.96 2.20 7.9 ×
UDOO: 8 cores (2 nodes) 1.3 1.61 0.75 2.84 10.1 ×
Zynq 706 board (FPGA) Times = 0.5s GFLOPS = 4.06 15.3 
Fig. 14. Execution time for the matrix multiply of size 2048 × 2048 and blocks of 
128 × 128 using NEON SIMD instructions. 
8
 
m  
e  
u  
s  
t  
N  
c  
t  
b  
a  
p  
t  
o  
p  
t  
Z  
p  
c  
c  
i
 
B  
t  
i  
m  
o  
w  
h  
c  
t  
i  
t  
t  
f  
b  
t  
t  orming demographics estimation (e.g., age, gender and ethnicity).
he SVS scenario will employ state-of-the-art cognitive computer
ision techniques based on models built from a boosted cascade
f classiﬁers combined with deep convolutional neural networks.
 low-power high-performance inference engine for such models
ill be implemented in the reconﬁgurable logic of the SoC using
he OmpSs programming model. Since this scenario will analyze
igh-deﬁnition (HD) video feeds, other computational challenges
elated to video processing must also be addressed. HD video
tream decoding (i.e., format parsing, codec implementation, de-
uxing and color space conversion) will be performed by relying
n a heterogeneous computing approach combining single instruc-
ion, multiple data (SIMD) instructions with on-die logic blocks. 
B. Smart Home Living (SHL) 
Regarding the SHL case study, a solution to enhance the secu-
ity level of the house and to increase the comfort of the smart
ome has been implemented. The solution developed consists on
 system that is able to analyze multimedia streams captured in
peciﬁc points inside and outside the house. Fig. 13 shows an
verview of SHL scenario. The system receives the multimedia
treams from the networks, splits and decodes the audio and the
ideo streams and analyses the raw data using machine learning
lgorithms to extract information from the two media components.
he information extracted from the media data are correlated to
eﬁne the feedback that will be sent to the user of the house. The
ain goal of this project is to achieve a high level of automation
nd allowing a natural interaction between the user and the house.
o achieve this goal is required that the time to analyze the data
nd to produce the feedback should not interrupt the user’s actions
ow and this represents a strict timing constraint for our system.
o take advantage of the heterogeneity and the cluster architec-
ure of the AXIOM system the OmpSs directives will be introduced
n the code of the SHL application. Different solutions will be ex-
lored for the purpose of deﬁning the tasks that can be concur-
ently executed and deﬁning the granularity of these to satisfy the
equirements of the SHL application. OmpSs@FPGA directives will
e used to eﬃciently synthesize the most time consuming sections
f the algorithms on FPGA resources and OmpSs@Cluster will be
sed to split the execution of the application in different nodes of
he cluster to gain the timing constraints and at the same time to
inimize the hardware resources and energy consumption. 
. Evaluations and results 
In this section, we present some preliminary results for some
oftware and hardware prototypes the AXIOM project is designing
nd implementing. Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018.1. Ompss timing results 
Table 3 shows the execution time and GFLOPS of the matrix
ultiplication of Fig. 2 for different execution environments. Those
nvironments are: i) one core of the UDOO × 86 cluster [34,35] ,
sing/not using NEON SIMD instructions, ii) all cores (four) of the
ame node of the UDOO cluster, with or without NEON instruc-
ions, iii) all cores of the two-node UDOO cluster, with and without
EON instructions and iv) a Zynq ZC706-SoC using the FPGA to ac-
elerate the matrix multiply tiles. All the results are for a tiled ma-
rix multiply with BS = 128 and 1024x1024 matrices. The num-
er of worker threads to perform the computation is the same
s the number of cores used. Speedup results are obtained com-
aring each environment result, with NEON instructions or FPGA,
o the UDOO 1 core environment without NEON instructions. On
ne hand, the use of NEON SIMD instructions signiﬁcantly im-
roves the application performance, and in general, it seems that
here is good scalability inside one node. Nevertheless, the Zynq
C706 board result, using FPGA accelerators for the matrix multi-
ly, shows a much better performance than the UDOO cluster. It
an be stated that the AXIOM platform will outperform the UDOO
luster (using NEON instructions) by 1.5 ×, if only the FPGA is used
n each node. 
Fig. 14 shows the execution time of a matrix multiply with
S = 128 and 2048x2048 size for 1 and 2 UDOO nodes and 1
o 4 worker threads per node for the case of using NEON SIMD
nstructions. With only one node and 4 threads, the OmpSs matrix
ultiply already achieves a speed-up of 3.8 × compared to the case
f 1 node and 1 thread; showing that OmpSs@cluster scales pretty
ell inside one node. However, it seems that there are some over-
eads that reduce the scalability when using the two nodes of the
luster. One possible reason is the fact of using one helper thread
o do the communication management, only necessary when hav-
ng more than one node in the cluster. That provokes oversubscrip-
ion when using 4 worker threads, plus the communication helper
hread, in the 2 nodes case (with only 4 cores per node). Although
or large systems this is not an issue, improvements on this may
eneﬁt applications using the full system resources. Furthermore,
he connection done by Ethernet may also introduce synchroniza-
ion and communication overheads. Therefore, the use of the high-rm for next-generation cyber physical systems, Microprocessors 
 
10 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 15. Execution time of the OmpSs N-body using 1 and 2 UDOO nodes, with up 
to 4 threads per node. 
Fig. 16. Paraver trace of the OmpSs M × M using 2 nodes UDOO x86, with 4 threads 
per node. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 17. Paraver trace of the OmpSs MxM using 1 SMP (top) and 1 helper thread 
(bottom) for two FPGA accelerators. 
Fig. 18. Paraver trace of the OmpSs M × M using a master thread (top) to submit 
tasks to two FPGA accelerators (the two on the bottom). 
f  
t
8
 
t  
l  
e  
t  
F  
m  
a  
f  
t  
(  
F  
t  
d  
p  
(  
a  
o  
r  
F
 
a  
c  
i  
t  
S  
g  
M  
c  
e  
i
 
t  
o  
c  
p  
p  
F  
t  
-  
F  
m  
D  
e  speed dedicated interconnection AXIOM-Link should help to re-
duce those overheads and improve the scalability of the OmpSs
applications. 
Fig. 15 shows the execution of the OmpSs N-Body with target
device(smp) for 1 and 2 nodes and blocksize BS = 128 , and
from 1 to 4 threads per node. The number of iterations done in
the N-Body is 100 and the number of particles is 8 K . In this case,
the block-size determines which amount of particles forces or up-
dated particle positions should be computed by one task. Results
show, as for the matrix multiply, that OmpSs@Cluster scales pretty
well inside one node meanwhile it seems that the scabilitiy when
using the two nodes is not ideal. The reason seems to be the same
as before: the overhead of communication and the oversubcription
due to the communication helper thread, and then, again, the use
of the interconnection AXIOM-Link should help. 
8.2. Proﬁling and tracing results 
In this sub-section, proﬁling and tracing results are presented.
Cluster proﬁling results have been obtained using a cluster of
UDOO × 86’s; meanwhile, one node traces with fpga task exe-
cutions are on a Zynq 706 board. 
8.2.1. Cluster proﬁling 
Fig. 16 shows the execution of the OmpSs matrix multiply
( BS = 128 ) in Fig. 2 with target device(smp) . The cluster has
two nodes with four threads per node, each of them executing smp
tasks. The Paraver trace has as many horizontal lines as threads
running OmpSs tasks. The different colors mean different thread
states along the execution time of the application. Therefore, there
are eight horizontal lines (one per thread). Green ﬂags indicate
trace events (e.g., start/end a task). Main area colors in the trace
have the following meaning: pink areas correspond to the task
creation on the master thread (top), yellow areas correspond to
smp tasks running in the SMP, light red in the master thread
(ﬁrst horizontal line) corresponds to a global task synchronization,
and dark red corresponds to idle state where those threads are
doing nothing. The trace shows that tasks have been evenly dis-
tributed among the two UDOO’s nodes, achieving a promising per-Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018ormance result. In this Paraver trace, the dependences between
asks have not been shown for clarity purposes. 
.2.2. One node proﬁling 
In this case, and for the purpose of presenting an execution
race that helps to detect a performance bottleneck, we have se-
ected a sub-optimal hardware/software co-design of the param-
ters and task target devices of the tasks of an OmpSs applica-
ion in one node and accelerating tasks in the FPGA. Therefore,
ig. 17 shows a Paraver trace of the parallel execution of the OmpSs
atrix multiply ( BS = 128 ) in Fig. 2 , running in a Zynq machine
nd using one thread for smp task executions and one thread for
pga task submissions to two accelerators. In this Paraver trace,
here is one thread (top) running tasks in the SMP and one thread
helper thread) submitting tasks to two MxM accelerators in the
PGA (bottom). Green ﬂags indicate trace events (e.g., start/end a
ask) and Yellow lines between events/states indicate task depen-
ences. Main color areas in the trace have the following meaning:
ink areas correspond to the task creation on the master thread
top), yellow areas correspond to smp tasks running in the SMP
nd purple areas are for the submission of fpga tasks to one
f the FPGA accelerators. Light green in the helper thread cor-
esponds to thread waiting for more tasks to be submitted to the
PGA. 
On one hand, this execution trace shows signiﬁcant load imbal-
nce between the two threads. The reason is the decision of exe-
uting tasks in SMP when the FPGA, at the same task granularity,
s much faster than the SMP. The programmer could decide either
o specify only fpga tasks and/or change the task granularity at
MP. In fact, for the same task scheduling policy, when the pro-
rammer decides to specify only target device(fpga) for the
xM task the performance is much better. Fig. 18 shows an exe-
ution trace for this scenarios at the same scale than the previous
xecution trace; achieving a speed-up of more than 2 × (consider-
ng the matrix multiply part of the execution trace). 
On the other hand, those traces do not give much informa-
ion about the memory transfer (DMA) from/to Host/FPGA, possible
verlapping of memory transfers and FPGA acceleration, and FPGA
omputation time in the two accelerators. For this reason, it is im-
ortant to have hardware proﬁling support to provide useful FPGA
roﬁling information to the programmer, from inside the FPGA.
ig. 19 shows the zoom in of an execution trace where it is shown
he information of the DMA tranfers (input - DMA_in and output
 DMA_out ) and FPGA acceleration computation when using two
PGA accelerators. This helps to deepen the analysis of the perfor-
ance application. For instance, it is possible to see that DMA_in ,
MA_out and the FPGA accelerations of the two different accel-
rators may be overlapped during the execution, that the DMA_inrm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 11 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
Fig. 19. Partial view of a paraver trace of the OmpSs M × M where it is shown the 
DMA transfers ( DMA_in and DMA_out ) and computation time (FPGA acc) for two 
accelerators. 
Fig. 20. Instruction count normalized to the matrix size 256 (n = 256) and b is the 
block-size. 
e  
t  
a  
a
8
 
w  
e  
o  
b  
T  
n  
n  
5  
t  
o  
s  
f  
t  
p  
t  
s  
s  
m  
t  
p  
F
 
c  
n  
s  
b  
s  
s
Fig. 21. Speed-up of user cycles count normalized to the matrix size 256 (n = 256) 
and b is the block-size. 
9
 
t  
a  
s  
l  
A  
D  
p  
o  
o
 
c  
s  
t  
e  
o  
i  
v  
m  
b  
s  
p  
t  
t  
a  
h  
s  
e  
t  
s
 
m  
a  
t  
i  
e  
t  
a  
a  
m  
s  
w
1
 
v  xecution time is around 3 times the DMA_out , and very similar
o the FPGA acceleration computation. With that, the programmer
nd programming model developer can detect bottlenecks on the
pplications or the task execution model. 
.3. DF-Threads initial results 
We are reporting in this sub-section our experimental results
hen AXIOM platform consists of 1, 2 or 4 nodes. In this case, the
xecution model is based on the DF-Threads and the methodol-
gy illustrated in Section 6 . For simplicity, we use a well-known
enchmark which is the blocked matrix multiplication (see Fig. 2 ).
he parallelization is based on the ratio between the matrix size
 and the block size b (i.e., the expected number of DF-Threads is
/b ). In our experiment, we consider three matrix sizes: n = 256,
12, 1024 while the block size is ﬁxed to b = 4 and we report
he results in Fig. 20 and in Fig. 21 . In such cases, the number
f DF-Thread is respectively 64, 128, 256 . The interesting re-
ult is related to the total number of instructions. As we can see
rom Fig. 20 , for each matrix size the instruction count has almost
he same value once we vary the node size from 1 to 4 (three su-
erposing lines). The reason for that is due to the small overhead
o manage DF-Threads across nodes. Moreover, the number of in-
tructions follow the theoretical increase (i.e., the number of in-
tructions increases as O ( n 3 )) in the case of a classical block-matrix
ultiplication closely. We normalized the total number of instruc-
ions for each curve to the case of matrix size n = 256 to com-
are the three experimental cases and the theoretical O ( n 3 ) line in
ig. 20 . 
As we can see from Fig. 21 , the scalability improves signiﬁ-
antly when we have a larger number of threads. In the case of
 = 1024, b = 4 the speedup is almost ideal (for four nodes the
peedup is almost 4 ). We do not report here the effect of different
lock sizes, but for smaller block sizes we typically achieve better
calability [36] . What has to be stressed here is the possibility to
cale performance across nodes that have separate address spaces. Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018. Related works 
In last few years CPS domain has gained tremendous impor-
ance, and as a consequence lot of works been done by the
cademia as well as industry. We refer readers to the general
urvey on CPS [37] for more information. There are many simi-
ar/relevant completed (such as CONTREX [10] , MultiPARTES [13] ,
SAM [38] , SCUBA [39] ) and on-going European projects (such as
REAMS [11] , EM C 2 [12] ) which are focused on many different as-
ects (mostly focused on the mixed-criticality application domain)
f CPS. We are discussing few of them which are more relevant to
ur works. 
CONTREX project mainly focused on developing energy eﬃ-
ient and low-cost hardware design for embedded mixed-criticality
ystem based applications (such as automotive, aeronautics and
elecommunications). The primary aim of CONTREX is to enable
nergy and cost eﬃciency through the analysis and optimisation
f real-time, power, and temperature based on different critical-
ty levels demands. The main objective of MultiPARTES is to pro-
ide an execution environment and tools to support the develop-
ent of mixed-criticality applications. It offers multicore platform
ased virtualisation layer over partitioned embedded platforms to
eparate the execution environment in multicore systems. Com-
leted EU projects such as SCUBA also developed a CPS architec-
ure for self-organizing, cooperative and robust building automa-
ion systems (BAS) [39] . Similarly, automatic architecture synthesis
nd application mapping (ASAM) targeted a uniform process for
eterogeneous multicore embedded systems based on application
peciﬁc instruction-set processors. It aims at deﬁning a new design
nvironment by providing uniﬁed design methodology and set of
ools to allow rapid exploration of algorithms, architecture design
paces, and also system-level synthesis. 
Ongoing EU project such as DREAMS focuses on the cross do-
ain architecture based on open-source (XtratuM) virtualization
nd design tools for supporting execution of mixed critical applica-
ions on networked multicore chips; EMC 2 project provides a ﬂex-
ble MPSoC architecture which can be tailored by middleware for
xecuting real-time and mixed-criticality applications. These men-
ioned projects are highly focused on (mixed) critical applications
nd evaluate their platforms mostly on avionics, wind power based
pplication domains. However, AXIOM provides a generic program-
ing model which can work with its high-speed interconnect sub-
ystem on multiple platforms together with its full stack of soft-
are as well as proper hardware support. 
0. Conclusions 
In this paper we presented the AXIOM platform, which pro-
ides an integrated approach including a heterogeneous SoC (cur-rm for next-generation cyber physical systems, Microprocessors 
 
12 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[  
 
 
[  
 
 
 
 
[  
[  
 
 
 
 
 
 
 
 
 
[  
[  
 
 
 
 
 
 
[  
 
 
 
[  
 
 
 
[  
 
 
[  
 
 
 
 rently with an FPGA) board, a new high-performance connection
link to form clusters of processing nodes, and the task-based pro-
gramming model, that can support single and multiple-node het-
erogeneous parallel execution, transparently to the programmer. 
To evaluate the AXIOM platform, we ran two well-established
micro-benchmarks, namely the matrix multiplication and the N-
body simulation on the project software and hardware platforms.
Results show that performance scales well with respect to the
number of deployed processing nodes, while keeping the develop-
ment effort low for application programmers. 
Acknowledgment 
This work is partially supported by the European Union
H2020 program through the AXIOM project (grant ICT-01-2014
GA 645496 ) and HiPEAC ( GA 687698 ), by the Spanish Govern-
ment through Programa Severo Ochoa (SEV-2015-0493), by the
Spanish Ministry of Science and Technology through TIN2015-
65316-P project, and by the Generalitat de Catalunya (contracts
2014-SGR-1051 and 2014-SGR-1272 ). We also thank the Xilinx Uni-
versity Program for its hardware and software donations. 
References 
[1] Cyber Physical Systems Public Working Group and others, Framework for
cyber-physical systems, Preliminary Discussion Draft, Release 0.8. 2015. 
[2] J. Sztipanovits , S. Ying , Strategic R&D Opportunities for 21st Century Cyber–
Physical Systems, Technical Report for Steering Committee for Foundation in
Innovation for Cyber-Physical Systems: Chicago, IL, USA, 13 March, 2012 Tech-
nical report . 
[3] E. Geisberger , M. Broy , Living in a networked world: Integrated research
agenda cyber-Physical systems (agendaCPS), Herbert Utz Verlag, 2015 . 
[4] E.A. Lee , Cyber physical systems: design challenges, in: Object Oriented Real–
Time Distributed Computing (ISORC), 2008 11th IEEE International Symposium
on, IEEE, 2008, pp. 363–369 . 
[5] R. Baheti , H. Gill , Cyber-physical systems, Impact Control Technol. 12 (2011)
161–166 . 
[6] T. Sanislav , L. Miclea , Cyber-physical systems-concept, challenges and research
areas, J. Control Eng. Appl. Inform. 14 (2) (2012) 28–33 . 
[7] President’s Council of Advisors on Science and Technology (U.S.) and Mar-
burger, J.H. and Kvamme, E.F. and United States. Executive Oﬃce of the Pres-
ident, Leadership under challenge: Information technology R&D in a competi-
tive world. An assessment of the federal networking and information technol-
ogy R&D program, Technical Report, DTIC Document. 2007. 
[8] D. Theodoropoulos , D. Pnevmatikatos , C. Alvarez , E. Ayguade , J. Bueno ,
A. Filgueras , D. Jimenez-Gonzalez , X. Martorell , N. Navarro , C. Segura , C. Fer-
nandez , C. Fernandez , J.R. Saeta , P. Gai , A. Rizzo , R. Giorgi , The AXIOM project
(agile, extensible, fast i/o module), in: Embedded Computer Systems: Architec-
tures, Modeling, and Simulation (SAMOS), 2015 International Conference on,
IEEE, 2015, pp. 262–269 . 
[9] C. Alvarez , E. Ayguade , J. Bueno , A. Filgueras , D. Jimenez-Gonzalez , X. Martorell ,
N. Navarro , D. Theodoropoulos , D.N. Pnevmatikatos , D. Catani , C. Scordino ,
P. Gai , C. Segura , C. Fernandez , D. Oro , J.R. Saeta , P. Passera , A. Pomella ,
A. Rizzo , R. Giorgi , The AXIOM software layers, in: Digital System Design (DSD),
2015 Euromicro Conference on, IEEE, 2015, pp. 117–124 . 
[10] K. Grüttner , R. Görgena , S. Schreinera , F. Herrerab , P.P. nilb , J. Medinab ,
E. Villarb , G. Palermoc , W. Fornaciaric , C. Brandolesec , D. Gadiolic , E. Vitalic ,
D. Zonic , S. Bocchiod , L. Cevae , P. Azzonif , M. Poncinog , S. Vincog , E. Maciig ,
S. Cusenzah , J. Favaroh , R. Valenciai , I. Sanderj , K. Rosvallj , N. Khalilzadj ,
D. Quagliak , CONTREX: design of embedded mixed-criticality CONTRol systems
under consideration of EXtra-functional properties, in: 2016 Euromicro Confer-
ence on Digital System Design (DSD), 2016, pp. 286–293 . 
[11] R. Obermaisser , Z. Owda , M. Abuteir , End-to-end real-time communication in
mixed-criticality systems based on networked multicore chips, in: Euromicro
Conference on Digital Systems Design, 2015, pp. 293–302 . 
[12] W. Weber , A. Hoess , F. Oppenheimer , Emc2 a platform project on embed-
ded microcontrollers in applications of mobility, industry and the internet of
things, in: Digital System Design (DSD), 2015 Euromicro Conference on, IEEE,
2015, pp. 125–130 . 
[13] S. Trujillo , A. Crespo , A. Alonso , J. Pérez , Multipartes: Multi-core partitioning
and virtualization for easing the certiﬁcation of mixed-criticality systems, Mi-
croprocess. Microsyst. 38 (8) (2014) 921–932 . 
[14] A. Duran , E. Ayguadé, R.M. Badia , J. Labarta , X.M. Luis Martinell , J. Planas ,
OMPSS: a proposal for programming heterogeneous multi-core architectures,
Parallel Process. Lett. 21 (02) (2011) 173–193 . Please cite this article as: D. Theodoropoulos et al., The AXIOM platfo
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018[15] Barcelona Supercomputing Center, Extrae instrumentation library, (accessed
June 16, 2016) [Online]. Available: https://tools.bsc.es/extrae . 2016. 
[16] V. Pillet , J. Labarta , T. Cortes , S. Girona , Paraver: A tool to visualize and analyze
parallel code, in: Proceedings of WoTUG-18: Transputer and Occam Develop-
ments, volume 44, 1995, pp. 17–31 . 
[17] J. Balart , A. Duran , M. Gon , X. Martorell , E. Ayguadé, J. Labarta , Nanos mer-
curium: a research compiler for openMP, in: Proceedings of the European
Workshop on OpenMP, volume 8, 2004, p. 56 . 
[18] A. Duran , R. Ferrer , E. Ayguadé, R.M. Badia , J. Labarta , in: A Proposal to Extend
the openMP Tasking Model with Dependent Tasks, 37, 2009, pp. 292–305 . 
[19] J. Bueno , X. Martorell , R.M. Badia , E. Ayguadé, J. Labarta , Implementing OMPSs
support for regions of data in architectures with multiple address spaces, in:
Proceedings of the 27th International ACM Conference on International Con-
ference on Supercomputing, ACM, 2013, pp. 359–368 . 
[20] D. Bonachea, Gasnet speciﬁcation, v1.1, 2002. 
[21] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard,
Version 3.0., 2016 (accessed June 16, 2016). [Online]. Available: http://www.
mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf . 2016. 
22] A.E. Eichenberger , J. Mellor-Crummey , M. Schulz , M. Wong , N. Copty , R. Diet-
rich , X. Liu , E. Loh , D. Lorenz , OMPT: An openMP tools application program-
ming interface for performance analysis, in: OpenMP in the Era of Low Power
Devices and Accelerators, Springer, 2013, pp. 171–185 . 
23] B. Ford , G. Back , G. Benson , J. Lepreau , A. Lin , O. Shivers , The ﬂux OSKit: A sub-
strate for OS and language research., in: In Proceedings of the 16th ACM Sym-
posium on Operating Systems Principles, Saint-Malo, France, October 1997.,
1997, pp. 38–51 . 
[24] FLUX Research Group, OSKit List Memory Manager, (accessed May 17, 2017).
[Online]. Available: https://www.cs.utah.edu/ﬂux/oskit/html/oskit-wwwch25.
html . (2012). 
25] M. Masmano , I. Ripoll , P. Balbastre , A. Crespo , A constant-time dynamic storage
allocator for real-time systems., Real Time Syst. 40 (2) (2008) 149–179 . 
26] C. Silvano , W. Fornaciari , G. Palermo , V. Zaccaria , F. Castro , M. Martinez , S. Boc-
chio , R. Zafalon , P. Avasare , G. Vanmeerbeeck , C. Ykman-Couvreur , M. Wouters ,
C. Kavka , L. Onesti , A. Turco , U. Bondik , G. Mariani , H. Posadas , E. Villar , C. Wu ,
F. Dongrui , Z. Hao , T. Shibin , Multicube: Multi-objective design space explo-
ration of multi-core architectures, in: VLSI 2010 Annual Symposium, Springer,
2011, pp. 47–63 . 
[27] R. Giorgi , R.M. Badia , F. Bodin , A. Cohen , P. Evripidou , P. Faraboschi , B. Fech-
ner , G.R. Gao , A. Garbade , R. Gayatri , S. Girbal , D. Goodman , B. Khan , S. Ko-
liaï, J. Landwehr , N.M. Lê, F. Li , M. Lujànj , A. Mendelson , L. Morin , N. Navarro ,
T. Patejko , A. Pop , P. Trancoso , T. Ungerer , I. Watson , S. Weis , S. Zuckerman ,
M. Valero , TERAFLUX: harnessing dataﬂow in next generation teradevices, EL-
SEVIER Microprocess. Microsyst. 38 (8, Part B) (2014) 976–990 . 
28] R. Giorgi , Exploring future many-core architectures: The TERAFLUX evaluation
framework, Advances in Computers, Elsevier, 2016 . 
29] E. Argollo , A. Falcón , P. Faraboschi , M. Monchiero , D. Ortega , COTSOn: infras-
tructure for full system simulation, ACM SIGOPS Oper. Syst. Rev. 43 (1) (2009)
52–61 . 
[30] S. Li , J.H. Ahn , R.D. Strong , J.B. Brockman , D.M. Tullsen , N.P. Jouppi , McPAT:
an integrated power, area, and timing modeling framework for multicore and
manycore architectures, in: Proceedings of the 42nd Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture, ACM, 2009, pp. 469–480 . 
[31] R. Giorgi , P. Faraboschi , An introduction to DF-threads and their execution
model, in: Computer Architecture and High Performance Computing Workshop
(SBAC-PADW), 2014 International Symposium on, IEEE, 2014, pp. 60–65 . 
32] R. Giorgi , A. Scionti , A scalable thread scheduling co-processor based on
data-ﬂow principles, Fut. Gener. Comput. Syst. 53 (2015) 100–108 . 
[33] R. Giorgi , Teraﬂux: exploiting dataﬂow parallelism in teradevices, in: Proceed-
ings of the 9th conference on Computing Frontiers, ACM, 2012, pp. 303–304 . 
[34] UDOO, UDOOX86: The Most Powerful Maker Board Ever, (accessed May 17,
2017).[Online]. Available: https://www.kickstarter . (2016). 
[35] A. Rizzo , G. Burresi , F. Montefoschi , R. Giorgi , Making iot with UDOO, Interact.
Des. Arch. (2016) 95–112 . 
36] R. Giorgi , Scalable embedded systems: Towards the convergence of high-per-
formance and embedded computing, in: Embedded and Ubiquitous Computing
(EUC), 2015 IEEE 13th International Conference on, IEEE, 2015, pp. 148–153 . 
[37] J. Shi , J. Wan , H. Yan , H. Yan , A survey of cyber-physical systems, in: Wireless
Communications and Signal Processing (WCSP), 2011 International Conference
on, IEEE, 2011, pp. 1–6 . 
38] L. Jozwiak , M. Lindwer , R. Corvino , P. Meloni , L. Micconi , J. Madsen , E. Diken ,
D. Gangadharan , R. Jordans , S. Pomata , P. Pop , G. Tuveri , L. Raffo , G. No-
tarangelo , ASAM: Automatic architecture synthesis and application mapping,
Microprocess. Microsyst. 37 (8) (2013) 1002–1019 . 
39] F. Bernier , J. Ploennigs , D. Pesch , S. Lesecq , T. Basten , M. Boubekeur , D. Den-
teneer , F. Oltmanns , F. Bonnard , M. Lehmann , T.L. Mai , A.M. Gibney , S. Rea ,
F.A.o. Pacull , C. Guyon-Gardeux , L.-F. Ducreux , S. Thior , J. Verriet , M. Hendriks ,
S. Fedor , Architecture for self-organizing, co-operative and robust building au-
tomation systems, in: Industrial Electronics Society, IECON 2013-39th Annual
Conference of the IEEE, IEEE, 2013, pp. 7708–7713 . rm for next-generation cyber physical systems, Microprocessors 
 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 13 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
egree) and M.Sc degree respectively from the Electronic and Computer Engineering de- 
2007, he joined the Computer Engineering department of the Delft University of Tech- 
011, he joined the Computer Architecture and VLSI Systems group at the Foundation for 
re he is working as a post-doc researcher for national and international research projects. 
stems, Computer Architecture, and Reconﬁgurable computing. 
t of Information Engineering, University of Siena, Italy. His main research interests are 
hitectures (mainly processor architectures, interconnects), Cloud computing, resource al- 
ted on a variety of projects, involving qualitative and quantitative analysis of Cloud, Grid 
ommunications in 1986 and the Ph.D. degree in Computer Science in 1989, both from the 
87 Prof. Ayguadé has been lecturing at the Computer Science School (FIB) and Telecom- 
rently, and since 1997, he is full professor of the Computer Architecture Department at 
ate and graduate) courses related with computer organization and architecture, parallel 
uadé is also involved in the Computer Architecture and Technology PhD Program at UPC, 
topics related with his research interests: multicore architectures, parallel programming 
 HPC architectures. In these research topics, Prof. Ayguadé has published more than 300 
e framework of the European Union and research collaborations with companies related 
 Samsung). Currently Prof. Ayguadé is associated director for research on the Computer 
nter (BSC-CNS), the National Center for Supercomputing in Spain located in Barcelona. 
ing at University of Padua and in 2011 he obtained his M.S degree in Electronic Engineer- 
logy Transfer Team T3LAB, in Bologna, and co-founded the FPGA Group. He did research 
hine vision and developed commercial solutions for processing multimedia data stream 
tions and heterogenic multi-core system-on-chip solutions. He joined the electronic R&D 
mainly focused on human interaction with smart home systems. 
ience from the Technical University of Catalonia (UPC). He became involved in research 
European Center of Parallelism of Barcelona (CEPBA) working with Software Distributed 
 researcher at the Barcelona Supercomputing Center (BSC) and continued his work on 
is thesis, which provided the OmpSs programming model with support for clusters of 
en applied to different research projects such as the Mont-Blanc2 project. His current 
ools to ease the complexity of developing applications for modern HPC systems. 
unication and New Media at the University of Siena (Italy) where she is now holding 
020 European AXIOM Project. Her research interests include IXD, UxD and Interactive Dimitris Theodoropoulos obtained his Diploma (5-year d
partment at the Technical University of Crete, Greece. In 
nology, the Netherlands, where he received his PhD. In 2
Research and Technology - Hellas (FORTH) in Greece, whe
His research interests are in the domains of Embedded Sy
Somnath Mazumdar is a PhD student at the Departmen
Dataﬂow-based computing, (heterogeneous) computer arc
location, and performance analysis. In the past, he consul
computing. 
Eduard Ayguadé received the Engineering degree in Telec
Universitat Politècnica de Catalunya (UPC), Spain. Since 19
munications Engineering (ETSETB) both in Barcelona. Cur
UPC. Prof. Ayguadé has lectured a number of (undergradu
programming models and their implementation. Prof. Ayg
where he has (co-)advised more than 20 PhD thesis, in 
models and their architectural support and compilers for
papers and participated in several research projects in th
with HPC technologies (IBM, Intel, Nvidia, Microsoft and
Sciences Department at the Barcelona Supercomputing Ce
Nicola Bettin earned his B.S degree in Electronic Engineer
ing at University of Bologna. In 2012 he joined the Techno
in the design of a standard HW/SW architecture for mac
in embedded systems. His main interests were FPGA solu
dept. at Vimar Group in 2015 and his research activity is 
Javier Bueno Hedo holds a PhD. degree in Computer Sc
in 2004, when he started as a part-time student in the 
Memory Systems. In 2006 he became a full-time junior
distributed systems. From 2010 to 2015 he worked on h
multi-cores and clusters of GPUs. This work has also be
research aims to produce new programming models and t
Sara Ermini has a Master’s degree in Persuasive Comm
a research position in Scenario Based Design for the H2
Machine Learning. Please cite this article as: D. Theodoropoulos et al., The AXIOM platform for next-generation cyber physical systems, Microprocessors 
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018 
14 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
at Universitat Politècnica de Catalunya - BarcelonaTech (UPC) in 2012. Currently working 
puting Center and participating in the AXIOM European project. His research interests 
s for high performance computing and programmability of those. 
egrees in Computer Science from the Technical University of Catalunya (UPC) in 1997 and 
ed Assistant Professor in the Computer Architecture Department at UPC, BarcelonaTech, 
Programming Models Department at BSC-CNS. His research interests cover the areas of 
onﬁgurable solutions for high-performance multiprocessor systems. Dr. Jimenez-Gonzalez 
l journals and conferences. He is currently co-advising 1 PhD student and has co-advised 
Network of Excellence and in the SARC, ACOTES, TERAFLUX, AXIOM and PRACE European 
mputer Science from the Technical University of Catalunya (UPC) in 1998 and 2007, re- 
stant Professor in the Computer Architecture Department at UPC, BarcelonaTech, and is 
amming Models Department at BSC-CNS. His research interests cover the areas of par- 
lutions for high-performance multiprocessor systems. He has coauthored more than 40 
e is currently advising 1 PhD student and has co-advised 2 PhD theses. He has been 
the TERAFLUX and AXIOM European projects. 
 Computer Science from the Technical University of Catalunya (UPC) in 1991 and 1999, 
ems, parallel runtime systems and OS administration. He has been an associate professor 
1. His research interests cover the areas of operating systems, runtime systems, compilers 
stems. Dr. Martorell has participated in several long-term research projects with other 
 the European Union ESPRIT, IST and FET programs. He spent one year working with the 
oauthored more than 60 publications in international journals and conferences. He has 
 3 PhD students. He is currently the Manager of the Parallel Programming Models team 
ticipating in the HiPEAC Network of Excellence and in the SARC, ACOTES, and Intone, POP, 
OM European projects. 
ter Science Engineering at UniversitÃa˘ degli Studi di Siena in 2015. He worked as software 
us from 2009 to 2015. Currently is holding a research position at Università degli Studi 
 the UDOO Team, porting and optimizing the supported operating systems (Ubuntu and 
dded systems, machine learning, hardware hacking, do-it-yourself, open source software, 
er Science from Universitat Politècnica de Catalunya (UPC) in 2006, and an M.S. degree 
ted his professional career in 2005 working as a consultant in performance monitoring 
logy Centre where he held a research position on online banking cybercrime mitigation 
ading the GPU parallelization of several products. He has published several papers in 
tents. His research interests include computer architecture, GPU computing and malware Antonio Filgueras received a degree in computer science 
at the Programming Models group of Barcelona Supercom
are focused on heterogeneous and reconﬁgurable solution
Dr. Daniel Jiménez-González received the M.S. and Ph.D. d
2004, respectively. He currently holds a position as Tenur
and is a associated researcher at the Computer Sciences-
parallel architectures, runtime systems, compilers and rec
has coauthored more than 40 publications in internationa
2 PhD students. He has been participating in the HiPEAC 
projects. 
Carlos Álvarez received the M.S. and Ph.D. degrees in Co
spectively. He currently holds a position as Tenured Assi
an associated researcher at the Computer Sciences-Progr
allel architectures, runtime systems and reconﬁgurable so
publications in international journals and conferences. H
participating in the HiPEAC Network of Excellence and in 
Xavier Martorell received the M.S. and Ph.D. degrees in
respectively. Since 1992 he has lectured on operating syst
in the Computer Architecture Department at UPC since 200
and applications for high-performance multiprocessor sy
universities and industries, primarily in the framework of
BG/L team in the IBM Watson Research Center. He has c
co-advised three Ph.D. theses and he is currently advising
at the Barcelona Supercomputing Center. He has been par
ENCORE, MontBlanc (I and II), DEEP/DEEP-ER and the AXI
Francesco Montefoschi obtained his B.S. degree in Compu
developer in the passenger transportation domain at Allb
di Siena on the H2020 AXIOM Project. He is also part of
Android) for the UDOO boards. His interests include embe
web and mobile application development. 
David Oro received the B.S, and M.S. degrees in Comput
in Computer Architecture in 2011, also from UPC. He star
solutions. In 2009, he joined the Barcelona Digital Techno
for CaixaBank. Currently, he works for Herta Security le
international peer-reviewed conferences and holds two pa
analysis. Please cite this article as: D. Theodoropoulos et al., The AXIOM platform for next-generation cyber physical systems, Microprocessors 
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018 
D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 15 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
 of the Electronic and Computing Engineering Department, Technical University of Crete 
 Systems (CARV) Laboratory of the Institute of Computer Science, FORTH in Greece. He 
epartment of Computer Science, University of Crete in 1989 and M.Sc. and Ph.D. degrees 
 Science, University of Wisconsin-Madison in 1991 and 1995 respectively. His research 
e, where he investigates the Design and Implementation of High-Performance and Cost- 
ble Computing. 
ersitá di Siena and Co-founder UDOO (Present). Director Academy of Digital Arts and 
sociation of Cognitive Ergonomics (20 0 0–20 06). Member di WG30 NATO Human Factors 
e Programme Incitatif de Recherche sur l’Education et la Formation (PIREF) of the French 
oup of the Italian National Railways (1996–1999). Liaison for Apple Inc. for the Apple 
 Engineering at University of Pisa in 20 0 0 with a graduation thesis developed at the 
 development of the modular real-time kernel SHaRK. He obtained the PhD from Scuola 
 ERIKA Enterprise project, an open-source RTOS which recently reached the OSEK/VDX 
ries and universities. Since 2002 he is CEO and founder of Evidence Srl, a SME working on 
IKA- based industrial products in the automotive and white goods market. Since 2011 he 
urnkey solutions for the white goods market. His research interests include development 
ms, multi-processor systems, object-oriented programming, real-time operating systems, 
 experience in kernel and user space programming, especially in the virtualization and 
loper since 2013. He graduated (summa cum laude) in Computer Engineering at the Uni- 
of the FreeBSD network stack for high-speed networks”. After the thesis, he collaborated 
assthrough for VMs (ptnetmap) and other improvements for netmap (a framework for 
 Google Summer of Code 2015 to develop a FreeBSD support for ptnetmap. Currently, he 
virtualization in embedded systems. 
gineering at University of Pisa in 2007. He is expert of the whole Linux ecosystem, from 
 and modiﬁcation of libraries and development tools. He has a good knowledge of C/C++ 
rks as Linux Software Engineer at Evidence Srl since 2007. 
 of Vimar S.p.A., Standalone and Home and Building Automation products (present). R&D 
UX and embedded PC development group at SELCA S.p.A. (1992–2001); Project Validation 
lectronic Engineering from Politecnico of Torino in 1990, with specialization in software Dionisios Pnevmatikatos is a Professor and former Chair
and a Researcher at the Computer Architecture and VLSI
received his B.Sc. degree in Computer Science from the D
in Computer Science from the Department of Computer
interests are in the broader area of Computer Architectur
Effective Systems, Reliable System Design, and Reconﬁgura
Antonio Rizzo Full Professor of Interaction Design, Univ
Science’ - ArsNova (20 0 0–20 09). Chair of the European As
and Human Reliability Group (1999 - 2002). Member of th
Government (20 02–20 03). Head of the Human Factor Gr
Design Project (1996–1997). 
Dr. Paolo Gai , CEO, graduated (cum laude) in Computer
ReTiS Laboratory of the Scuola Superiore Sant’Anna on the
Superiore Sant’Anna in 2004. Since 2000, he founded the
certiﬁcation, and which is currently used by various indust
operating systems and code generation for Linux- and ER
is President and founder of SSG Srl, providing hardware t
of hard real-time architectures for embedded control syste
scheduling algorithms and multimedia applications. 
Stefano Garzarella is a Software Engineer with a strong
networking ﬁelds. He has been a FreeBSD and Linux deve
versity of Pisa in 2014 for a thesis entitled “Optimization 
with Prof. Luigi Rizzo for developing a virtual netmap p
high-speed packet I/O). He successfully participated at the
works for Evidence SRL on Linux kernel, networking, and 
Dr. Bruno Morelli received Master Degree in Computer En
the development of Linux kernel drivers to the integration
languages, bash scripting, ARM and x86 assembler. He wo
Alberto A. Pomella , Electronics & Software R&D Manager
Director at CRS (20 01–20 03), Home automation Products; 
Group for consumer PC at ASEM (1991–1992). Degree in E
development and industrial automation. Please cite this article as: D. Theodoropoulos et al., The AXIOM platform for next-generation cyber physical systems, Microprocessors 
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018 
16 D. Theodoropoulos et al. / Microprocessors and Microsystems 0 0 0 (2017) 1–16 
ARTICLE IN PRESS 
JID: MICPRO [m5G; July 7, 2017;14:21 ] 
 Information Engineering, University of Siena, Italy. He was Research Associate at the Uni- 
 in Computer Engineering and his Master in Electronics Engineering, Summa cum Laude 
 the European Project AXIOM. He coordinated the TERAFLUX project in the area of Future 
 is participating in the European projects HiPEAC (High Performance Embedded-system 
le Architectures). He contributed to SARC (Scalable ARChitectures), ChARM (performance 
His current interests include Computer Architecture themes such as Embedded Systems, 
Characterization. Roberto Giorgi is an Associate Professor at Department of
versity of Alabama in Huntsville, USA. He received his PhD
both from University of Pisa, Italy. He is the coordinator of
and Emerging Technologies for Teradevice Computing. He
Architecture and Compiler), ERA (Embedded Reconﬁgurab
evaluation of ARM-processor based embedded systems). 
Multiprocessors, Memory System Performance, Workload Please cite this article as: D. Theodoropoulos et al., The AXIOM platform for next-generation cyber physical systems, Microprocessors 
and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2017.05.018 
