The AXIOM Software Layers by Alvarez, C. et al.
Microprocessors and Microsystems 47 (2016) 262–277 
Contents lists available at ScienceDirect 
Microprocessors and Microsystems 
journal homepage: www.elsevier.com/locate/micpro 
The AXIOM software layers 
Carlos Álvarez a , ∗, Eduard Ayguadéa , Jaume Bosch a , Javier Bueno a , Artem Cherkashin a , 
Antonio Filgueras a , Daniel Jiménez-González a , Xavier Martorell a , Nacho Navarro a , 
Miquel Vidal a , Dimitris Theodoropoulos b , Dionisios N. Pnevmatikatos b , Davide Catani c , 
David Oro d , Carles Fernández d , Carlos Segura d , Javier Rodríguez d , Javier Hernando e , 
Claudio Scordino f , Paolo Gai f , Pierluigi Passera g , Alberto Pomella g , Nicola Bettin g , 
Antonio Rizzo h , Roberto Giorgi h 
a Barcelona Supercomputing Center and Computer Architecture Dept., Universitat Politecnica de Catalunya, Barcelona, Spain 
b FORTH-ICS, Greece 
c SECO, Arezzo, Italy 
d Herta Security, Barcelona, Spain 
e Universitat Politecnica de Catalunya, Barcelona, Spain 
f Evidence Srl, Pisa, Italy 
g VIMAR SpA, Marostica, Italy 
h University of Siena, Siena, Italy 
a r t i c l e i n f o 
Article history: 
Received 18 January 2016 
Revised 1 June 2016 
Accepted 7 July 2016 
Available online 9 July 2016 
Keywords: 
Cyber-physical systems 
Ompss 
Cluster programming 
FPGA Programming 
Distributed shared memory 
Smart home 
Smart video-surveillance 
a b s t r a c t 
People and objects will soon share the same digital network for information exchange in a world named 
as the age of the cyber-physical systems. The general expectation is that people and systems will interact 
in real-time. This poses pressure onto systems design to support increasing demands on computational 
power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are 
also important to ensure these systems to become widespread. The whole set of expectations impose 
scientiﬁc and technological challenges that need to be properly addressed. 
The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architec- 
tures for cyber-physical systems to meet such expectations. The technical approach aims at solving funda- 
mental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AX- 
IOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication 
interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect 
embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, 
with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated 
with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics). 
© 2016 The Authors. Published by Elsevier B.V. 
This is an open access article under the CC BY-NC-ND license 
( http://creativecommons.org/licenses/by-nc-nd/4.0/ ). ∗ Corresponding author. 
E-mail addresses: carlos.alvarez@bsc.es (C. Álvarez), eduard.ayguade@bsc.es 
(E. Ayguadé), jaume.bosch@bsc.es (J. Bosch), javier.bueno@bsc.es (J. Bueno), 
artem.cherkashin@bsc.es (A. Cherkashin), antonio.ﬁlgueras@bsc.es (A. Filgueras), 
daniel.jimenez@bsc.es (D. Jiménez-González), xavier.martorell@bsc.es (X. Mar- 
torell), nacho.navarro@bsc.es (N. Navarro), miquel.vidal@bsc.es (M. Vidal), 
dtheodor@ics.forth.gr (D. Theodoropoulos), pnevmati@ics.forth.gr (D.N. Pnev- 
matikatos), davide.catani@seco.com (D. Catani), david.oro@hertasecurity.com (D. 
Oro), carles.fernandez@hertasecurity.com (C. Fernández), cseguramail@gmail.com (C. 
Segura), javier.rodriguez@hertasecurity.com (J. Rodríguez), javier.hernando@upc.edu 
(J. Hernando), claudio@evidence.eu.com (C. Scordino), pj@evidence.eu.com (P. 
Gai), pierluigi.passera@vimar.com (P. Passera), alberto.pomella@vimar.com (A. 
1
 
a  
c  
s  
a  
s  
t  
P
g
http://dx.doi.org/10.1016/j.micpro.2016.07.002 
0141-9331/© 2016 The Authors. Published by Elsevier B.V. This is an open access article u. Introduction 
We are entering the Cyber-Physical age, in which both objects
nd people will become nodes of the same digital network for ex-
hanging information. Therefore, the expectation is that “things” or
ystems will become somewhat smart as people, having to permit
 rapid and close interaction not only human-human and system-
ystem, but also human-system, and system-human. More scien-
iﬁcally, we expect that such Cyber-Physical Systems (CPS) will atomella), nicola.bettin@vimar.com (N. Bettin), antonioriz@gmail.com (A. Rizzo), 
iorgi@dii.unisi.it (R. Giorgi). 
nder the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ). 
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 263 
Fig. 1. The AXIOM Software Layers. 
l  
t  
t  
i  
m
 
r  
t  
a  
i  
t  
T  
b  
t  
t  
t  
t  
u
 
b  
t  
t  
s  
L  
F  
a  
n  
f  
l  
t  
h  
t  
a  
d  
w
 
 
 
 
 
 
 
 
 
 
 
 
 
 
t  
v  
e  
l  
p  
m
2
 
t  
t  
s  
r
 
t  
u  
o  
n  
f  
p  
a  
r  
t  
o  
S  
i  
o  
e
 
i  
a  
m  
T  
b  
m  
(  
t  
n  
m
 
m  
d  
i
2
 
t  
t
 
t  
least react in real time, provide enough computational power for
he assigned tasks, consume the least possible energy for such
asks (energy eﬃciency), allow for an easy programmability, scal-
ng through modularity and exploit at best existing standards at
inimal costs. 
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at
esearching new hardware/software architectures for CPSs in which
he above expectations are realized. The project, started on Febru-
ry 2015, will span over 3 years. The coordination of the project
s carried out by the University of Siena (UNISI). UNISI also takes
he evaluation part of the project. Foundation for Research and
echnology - Hellas (FORTH) develops the interconnection between
oards. Barcelona Supercomputing Center (BSC) is responsible of
he OmpSs (OpenMP+StarSs) programming model and software
oolchain. Partner EVIDENCE takes the lead on the development of
he runtime systems. Partner SECO designs and builds the proto-
ype board. Partner HERTA Security provides a video-surveillance
se case. Partner VIMAR provides a smart-building use case. 
Fig. 1 shows the software layers used in this project. As it can
e seen the project addresses all the levels of the system, from
he application level, that includes two key application domains,
o the hardware level. That includes developing a speciﬁc runtime
oftware manager (OmpSs@FPGA), a fast interconnection link (Fast
ink) and even the AXIOM board itself. As can also be seen in
ig. 1 the project aims to develop a board that can work well both
lone or as part of a larger system (i.e. a group of boards intercon-
ected by the AXIOM link). This modular capabilities are addressed
rom both the hardware side (the implementation of the AXIOM
ink) and the software side (the development of inter-node execu-
ion capabilities using the OmpSs programming model). From the
ardware point of view is one of the aims of the project to make
he board accessible in terms of cost (as cheap as possible, even
round one hundred euros) while making it powerful enough to
eal with the envisioned use cases. This holistic development is
hat we call the AXIOM platform. 
The speciﬁc objectives of the AXIOM project are: 
• Realizing a small board that is ﬂexible (suitable for a wide
range of applications), energy eﬃcient and modularly scalable
(AXIOM Board in Fig. 1 ). We will use an ARM- and FPGA-based
chip with custom high-speed interconnects to build the AXIOM
prototype board. 
• Easy programmability of multi-core, multi-board, FPGA node,
with the OmpSs programming model (OmpSs@Cluster/OmpSs
over DSM, and OmpSs@FPGA in Fig. 1 ), and improved thread
management and real-time support from the operating system.
The software will be Open-Source. 
• Easy interfacing with the Cyber-Physical world, based on the
Arduino shields [1,2] , pluggable onto the board. This shields are
going allow the developed board to be extended with sensors(e.g. a camera). They will provide new functionalities to the de-
veloped board to widen the scope of its applications. 
• Contribute to standards, in the context of the Standardization
Group for Embedded Systems (SGET) and OpenMP. 
The rest of the paper is organized as follows. Section 2 explains
he AXIOM software layers. Section 3 explains the AXIOM link de-
elopment. Section 4 explains the applications evaluated and the
xpected scenarios. Section 5 explains the experimental setup fol-
owed by Section 6 that presents the ﬁrst results obtained by the
roject. Section 7 explains the related work. Finally, Section 8 sum-
arizes the conclusions and the envisioned future work. 
. The AXIOM software 
One of the problems when building a complex ecosystem like
he one described in Fig. 1 is how to easily program applications
hat should take advantage at the same time of both on-chip re-
ources (i.e. the FPGA and the multiple cores) and multiple board
esources (through fast link multiple board connection). 
Several solutions have been proposed during the last decades
o parallelize computations on multi-core systems. However, no
nanimous consensus on the best solution has been achieved. On
ne hand, some solutions are based on message-passing mecha-
isms (e.g., MPI), which are usually considered too diﬃcult to use
or developers not accustomed to parallel programming. For exam-
le, parallelizing existing legacy serial codes, like face detection,
udio processing or search algorithms, with MPI need a large code
ewriting to add the communication primitives and synchroniza-
ion needed. Usually this means to rewrite the full application at
nce to take advantage of the cluster. Instead, models targeting
MPs, are usually based on code annotations, that allow introduc-
ng less changes in the original code, and also incrementally work
n the different parts of the applications, that can be tested much
arlier than when using message passing. 
Another possibility that is going to be explored in this project
s the use of a DSM system. Distributed shared memory (DSM) is
 form of memory architecture where actually physically separate
emories can be addressed as one logically shared address space.
he main advantage of this memory organization is that it can
e easily programmed as the program can access all the available
emory despite its real physical location being the DSM support
probably integrated with the OS) the one responsible of managing
he communication. On the other hand, this management when
ot properly handled can lead to unnecessary or ineﬃcient com-
unication patterns. 
AXIOM will leverage OmpSs, a task dataﬂow programming
odel that includes heterogeneous execution support as well as
ata and task dependency management [3] and has signiﬁcantly
nﬂuenced the recently appeared OpenMP 4.0 speciﬁcation. 
.1. The OmpSs programming model 
In OmpSs, tasks are generated in the context of a team of
hreads that run in parallel. OmpSs provides an initial team of
hreads as speciﬁed by the user upon starting the application. 
Tasks are deﬁned as portions of code enclosed in the task direc-
ive , or as user-deﬁned functions, also annotated as tasks, as fol-
ows: 
# pragma omp task [ clause − list ] 
{ structured − work | 
function − declaration | 
function − definition } 
264 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
Fig. 2. General view of OmpSs@FPGA and OmpSs@Cluster execution context. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 3. General view of OmpSs over a DSM system. 
Fig. 4. OmpSs@FPGA ecosystem compilation ﬂow. 
 
O  
O  
o
2
 
c  
t  
c  
b  
o  
t  
e  
t  
t  
a
 
o  
a
 
O  
s  
i  
F  
i  
t  
T  
t  
b  
t  
e
 
t  
t  A task is created when the code reaches the task construct, or
a call is made to a function annotated as a task. The task construct
allows to specify, among others, the clauses in, out and inout . Their
syntax is: 
in (data − reference − list ) 
out (data − reference − list ) 
inout (data − reference − list ) 
The information provided is used to derive dependencies
among tasks at runtime, and schedule/ﬁre a task. Tasks are ﬁred
when their inputs are ready and their outputs can be generated. 
Dependencies are expressed by means of data-reference-lists. A
data-reference in such a list can contain either a single variable
identiﬁer, or also references to subobjects. References to subobjects
include array element references (e.g., a[4] ), array sections ( a[3:6] ),
ﬁeld references ( a.b ), and elaborated shaping expressions ( [10][20]
p ). The latter means the rectangular area starting at address p , with
a shape of 10 rows and 20 columns. 
OmpSs is based on two main components: i) The Mercurium
compiler gets C/C ++ and FORTRAN code, annotated with the task
directives presented above, and transforms the sequential code into
parallel code with calls to the Nanos ++ runtime system; and ii)
The Nanos ++ runtime system gets the information generated by
the compiler about the parallel tasks to be run, manages the task
dependences and schedules them on the available resources, when
those tasks are ready. Nanos ++ supports the execution of tasks in
remote nodes, and heterogeneous accelerators. 
At the lower level, the AXIOM project will investigate and im-
plement the OmpSs programming model on top of the following
intra- and inter-node technologies: 
• Intra-node: The most important target here is FPGA pro-
grammability support. 
- OmpSs@FPGA, for easy exploiting of the FPGA acceleration; 
• Inter-node: In this case two different approaches can be ad-
dressed based on the performance requirement, although they
can be integrated in the same scenarios, to work with different
memory address spaces. 
- OmpSs@cluster, for eﬃcient parallel programming hiding
message-passing complexities; 
- OmpSs on a DSM-like paradigm, for easy parallelization of
legacy code. 
Fig. 2 shows the overall view of OmpSs@FPGA and
OmpSs@cluster execution context in a multi-board system. Each
FPGA-based node will be addressed by the OmpSs@FPGA support
meanwhile the OmpSs@cluster will help to transparently program
all the multi-node system. Fig. 3 shows the overall view of a DSM system where
mpSs@FPGA would have the same intra-node inﬂuence and
mpSs@cluster will appear like a single intra-node OmpSs running
ver a transparent DSM system. 
.2. OmpSs@FPGA 
The OmpSs@FPGA ecosystem consists of the infrastructure for
ompilation instrumentation and execution from source code writ-
en in C/C ++ to ARM binary and FPGA bitstream for Zynq. The
ompilation infrastructure provides support to (1) generate ARM
inary code from OmpSs code, that can run in the ARM-based SMP
f the Zynq SoC, (2) extract the kernel of the part of the applica-
ion to be accelerated into the FPGA and (3) automatically gen-
rate a bitstream that includes the IP cores of the accelerator(s),
he DMA engine IPs, and the necessary interconnection. In addi-
ion, the ARM binary can be instrumented to generate traces to be
nalyzed oﬄine with the Paraver tool [4] . 
The runtime infrastructure should allow heterogeneous tasking
n any combination of SMPs and accelerators, depending on the
vailability of the resources and the target devices. 
Fig. 4 shows the high level compilation ﬂow using our
mpSs@FPGA ecosystem. The OmpSs code is passed through the
ource-to-source compiler Mercurium [5] , that includes a special-
zed FPGA compilation phase to process annotated FPGA tasks.
or each of those tasks, it generates two C codes. One of them
s a Vivado HLS (source to HDL Xilinx tool) annotated code for
he bitstream generation branch (“accelerator codes” box in Fig. 4 ).
he other is an intermediate host source code with OmpSs run-
ime (Nanos ++ ) calls that is generated for the software generation
ranch (“Host C code + Nanos ++ runtime call” box in Fig. 4 ). Both
he hardware and the software generation branches are transpar-
nt to the programmer. 
Fig. 5 shows a matrix multiply example that has been anno-
ated with OmpSs directives. This code shows a parallel tiled ma-
rix multiply where each of the tiles is a task. Each of those tasks
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 265 
Fig. 5. OmpSs directives on matrix multiplication. 
h  
m  
s  
g  
c  
N  
o
2
 
s  
t  
m  
a  
o  
w
 
o  
t  
t
 
n  
t  
s  
w  
t  
i  
n  
e
 
m  
t  
o  
g  
p
 
g  
n  
t  
O  
w  
Fig. 6. Nanos++ distributed memory management organization. 
i  
o  
i
 
t  
T  
s  
m  
m  
w  
a  
e  
f
 
d  
o  
t  
o  
t  
t  
t
 
b  
t  
i  
2
 
e  
a
 
p  
a  
i  
m  
l  
t  
w  
e
2
 
t  
r  
s  
p  
t  
mas two input dependences and an inout dependence that will be
anaged at runtime by Nanos ++ . Those tasks will be able to be
cheduled/ﬁred to an SMP or FPGA , as it is annontated in the tar-
et device directive, depending on the resource availability. The
opy_deps clause associated to the target directive hints the
anos ++ runtime to copy the data associated with the input and
utput dependences to/from the device when necessary. 
.3. OmpSs@cluster 
OmpSs@cluster is the OmpSs ﬂavor that provides support for a
ingle address space over a cluster of SMP nodes with accelera-
ors. In this environment, the Nanos ++ runtime system supports a
aster-worker execution scheme. One of the nodes of the cluster
cts as the master node, where the application starts. In the rest
f nodes where the application is executed, worker processes just
ait for work to be provided by the master. 
In this environment, the data copies generated either by the in ,
ut , inout task clauses are executed over the network connec-
ion across nodes, to bring data to the appropriated node where
he tasks are to be executed. 
Following the Nanos ++ design, cluster threads are the compo-
ents that allow the execution of tasks on worker nodes. These
hreads do not execute tasks themselves. They are in charge of
ending work descriptors to their associated nodes and notifying
hen these have completed their execution. One cluster thread can
ake care of providing work to several worker nodes. In the current
mplementation, cluster threads are created only on the master
ode of the execution. Slave nodes cannot issue tasks for remote
xecution and thus they do not need to spawn cluster threads. 
In Nanos ++ , the device speciﬁc code has to provide speciﬁc
ethods to be able to transfer data from the host address space
o the device address space, and the other way around. The mem-
ry coherence model required by OmpSs is implemented by two
eneric subsystems, the data directory and the data cache , ex-
lained below. 
Fig. 6 shows how the different Nanos ++ subsystems are or-
anized to manage the memory of the whole cluster. The master
ode is the responsible for keeping the memory coherent with
he OmpSs memory coherence model, and also for offering the
mpSs single address space view. The master node memory is
hat OmpSs considers the host memory or host address space , andt is the only address space exposed to the application. The mem-
ry of each worker node is treated as a private device memory and
s managed by the master node. 
The data cache component manages the operations needed at
he master node to transfer data to and from worker memories.
here is one data cache for each address space present on the
ystem. Operations performed in a data cache include allocating
emory chunks, freeing them and transferring data from their
anaged address spaces to the host address space and the other
ay around. Data caches also keep the mapping of host memory
ddresses to their private memory addresses. Memory transfer op-
rations are implemented using network transfers. Allocation and
ree operations are handled locally at the master node. 
A memory reference may have several copies of its contents on
ifferent address spaces of the system. To maintain the coherence
f the memory, the master node uses the data directory . It con-
ains the information of where the last produced values of a mem-
ry reference are located. With it, the system can determine which
ransfer operations must perform to execute a task in any node of
he system. Also, each task execution updates the information of
he data directory to reﬂect the newly produced data. 
The implementation of the network subsystem is currently
ased on the active messages provided by the GASNet communica-
ions library. In the context of AXIOM, we will adapt the network-
ng on the communications library provided for the Zynq platform.
.4. OmpSs on DSM-like systems 
DSM is a well-known research topic, and it can be implemented
ither at software or at hardware level (with a full range of hybrid
pproaches). 
We will work on the performance analysis of current DSM im-
lementations. After that the project will decide upon the design
nd development of a proper, reliable and eﬃcient mechanism to
mplement a DSM-like paradigm integrated in the Linux OS. The
echanism will run on the reference platform. It will allow to
everage the simplicity and scalability of the OmpSs framework on
op of the AXIOM platform. It will be released as Open-Source soft-
are, and it is expected to bring beneﬁts to both the ICT and the
mbedded industries. 
.5. Operating system support 
The operating system used in the project will be Linux. One of
he advantages of using a SoC like the Zynq is that Linux can be
un on the ARM cores of the platform off-the-shelf. This kind of
ystem has the advantage of the easiness to program a standard
rocessor like the ARM along with the raw performance power of
he FPGA fabric that will be used through the OmpSs programming
odel. 
266 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 7. The Network Interface controller structure. 
 
m  
s  
d  
t  
F  
d  
a  
(
 
a  
P  
s  
o  
n  
f  
s  
i
 
c  
w  
a  
m  
r  
u  
n  
i  
c
 
l  
t  
Z  
m
 
t  
w  
h  
D  
e  
t  
R  
t
 
m
4
 
V  We will investigate the possibility of integrating features in the
OS to load balance the work across the nodes through the high-
speed interconnection. Finding an eﬃcient solution is an aimed
outcome of the project since current solutions for load balancing
in distributed systems may be expensive, too speciﬁc, or diﬃcult
to program (with paradigms such as MPI). 
Particular attention will be given to scalability and latency is-
sues, by implementing lock-free data structures. Another relevant
aspect will be the necessity of properly managing events in real-
time. 
The OS scheduler will be extended to enable it distributing
threads across the different nodes. The low-level thread scheduler
(LLTS [6–10] , discussed in Section 7 ) may be accelerated in hard-
ware, by mapping its structure in the FPGA cards composing the
evaluation platform. This will avoid bottlenecks from the scheduler,
thus increasing the performance of parallel applications. 
3. The AXIOM link 
The AXIOM platform will be built around FPGA-based SoC, as
exempliﬁed by the Zynq platform by Xilinx. Zynq devices feature a
dual- or quad-core ARM Cortex A9 processor closely connected to
an FPGA fabric. The closeness of the connection (and hence the low
latency) and the ﬂexibility of the reconﬁgurable FPGA logic make
the combination very powerful in terms of customization. In ad-
dition, Zynq devices feature gigabit-rate transceivers that will be
used to provide ample communication bandwidth between AXIOM
nodes. 
In terms of connectivity, AXIOM -besides including classical
connectivity (e.g., Internet)- will also bring modularity at the next
level, allowing the construction of more compute intensive and
performance systems through low-cost but scalable high-speed in-
terconnect. This interconnect, subject of research and design dur-
ing the project, will utilize relatively low cost SATA connectors to
interconnect multiple boards. Such connectivity will allow to build
(or upgrade at a later moment) ﬂexible and low-cost systems with
simplicity by re-using the same basic (small) module without the
need of costly connectors and cables. 
We will provide three bi-directional links per board, so that
the nodes can be connected in many different ways, ranging from
ring, to the well-established 2D-mesh/torus, and up to arbitrary
3-D topologies such as mesh/torus. The AXIOM interconnect will
have customizable parameters (such as packet size, formats, etc) if
needed by applications, further improving the eﬃciency and per-
formance. 
In AXIOM we will provide a powerful network interface (NI)
-implemented in the FPGA region- that will eﬃciently support
the communication protocols needed by the applications. Besides
implementing a MPI-like communication library, we will sup-
port a (distributed) Shared Memory model with support from the
OmpSs programming model, the Operating System, and the Run-
time. One such optimization is the eﬃcient implementation of
remote direct memory access (RDMA) and remote-write opera-
tions as basic communication primitives visible at the application
level. 
The AXIOM interconnection library will support two main
packet types, (a) RDMA, and (b) short messages. RDMA packets
will be used to (a) request large data from a remote node (RDMA
requests), and (b) transmit large data (RDMA writes). Short mes-
sages will be used to exchange short data between nodes that will
contain either raw data or acknowledgement packets (ACKs). To-
wards a balanced and eﬃcient bandwidth network utilization, we
employ a packet priority transmission scheme; ACKs, RDMA writes
/ messages, and RDMA requests are classiﬁed with the highest,
middle and lowest transmission priority respectively. Fig. 7 illustrates the NI internal structure for inter-node com-
unication. The “RDMA FIFOs” will be used to store descriptors for
ending / receiving RDMA packets to / from remote nodes. RDMA
escriptors contain the local and remote node id, source and des-
ination data address, and ﬁnally the payload size. The “Raw data
IFOs” will be used to store descriptors for exchanging either short
ata messages or ACKs. Such descriptors will contain the source
nd destination node id, and also encapsulate the payload data
raw data or an ACK). 
The “NI control registers” are memory mapped registers that
llow the local Zynq processing system (PS) to conﬁgure the NI
HY loopback mode or toggle local notiﬁcations when data are
uccessfully transmitted. The “NI Status registers” are also mem-
ry mapped registers that allow the PS to monitor certain NI inter-
al states, such as the DMA engine progress, queues status (empty,
ull, etc.), PHY channel and link states. In addition, when a FIFO
tate moves from empty to not empty the “IRQ” module raises an
nterrupt to inform the PS. 
The hardware counters module (“HW cnt”) provides a set of
ounters to monitor the progress of RDMA requests and RDMA
rites. Every new RDMA request / RDMA write that reads / writes
 large set of data from / to a remote node, is essentially served by
ultiple short RDMA responses (packets that fetch subsets of the
equested data). Moreover, each RDMA request / RDMA write gets a
nique id that is assigned to a HW counter. The latter is set to the
umber of RDMA responses required to transmit all data; its value
s decremented each time an RDMA response is ﬁnished. The PS
an access all HW counters for debugging purposes via software. 
The “DMA engine” is responsible for storing incoming payload /
oading requested data to / from the required SDRAM address via
he PS coherency port (ACP). The “Aurora PHY link” utilizes the
ynq MGT transceivers to serially send / receive data to / from re-
ote nodes. 
The “packetizer” assembles a complete packet that will be sent
o a remote node; short messages and RDMA requests / RDMA
rites are forwarded to the Aurora PHY link, while RDMA response
eaders are appended with the requested payload provided by the
MA engine. In contrast, the “unpacketizer” caches incoming pack-
ts. Simple messages / ACKs and RDMA response headers are even-
ually stored to the “Raw data FIFOs”. Trailing payload data from
DMA responses are forwarded to the DMA engine and stored to
he SDRAM via the ACP, ensuring the PS data coherency. 
Finally, the NI “internal controller” (NIC) orchestrates the overall
odule functionality. 
. Application domains and examples of use 
AXIOM will be applied in two real life application domains:
ideo-surveillance and Smart-home. They will operate as bench-
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 267 
m  
p  
s  
p
4
 
p  
p  
s  
b  
r  
o
 
i  
d  
u  
a  
l  
c  
o  
t  
t  
n  
a  
c  
e  
c  
i  
t  
i  
h  
t
 
s  
s  
t  
t  
s  
c  
r  
u
4
 
o  
t  
a  
m  
m  
s  
p  
e  
e  
a
 
s  
d  
d  
d  
a  
a  
t  
s  
p
 
i  
a  
e  
F  
o  
w
 
s  
F  
o  
d
4
 
b  
d  
M  
S  
s  
a  
i  
e  
o  
t
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 arks for assessing the potentialities and the limits of the pro-
osed architecture. The two application domains have been cho-
en for the different kind of challenges to process capabilities they
ose. 
.1. Video-surveillance 
Intelligent multi-camera video surveillance is a multidisci-
linary ﬁeld related to computer vision, pattern recognition, signal
rocessing, communication, embedded computing and image sen-
ors. Smart video-surveillance has a wide variety of applications
oth in public and private environments, such as homeland secu-
ity, crime prevention, facial marketing and traﬃc control, among
thers. 
These applications are generally very computationally demand-
ng, since they require monitoring very diverse indoor and out-
oor scenes including airports, hotels or shopping malls, which
sually involve highly varying environments. In many cases it is
lso necessary to analyze multiple camera video streams, particu-
arly when object re-identiﬁcation or tracking of individuals across
ameras is required. For instance, a scenario where runners may be
bserved and recognized with different objectives: statistics, real-
ime detection of people that want to be video recorded during
he race, TV reportage where the TV oﬃcer only has to say the
ame of a runner and the corresponding camera becomes oper-
tive, etc. Another crowded scenario may be the case of a large
ompany with hundreds of employees that work in several differ-
nt places/buildings: an employee A in any room requests video-
onference with a person (in any place, any building) and AXIOM,
n real-time, detects where this person is and requests permission
o begin videoconference room-to-room by telling that person: “A
s requesting a videoconference”. Real-time recognition may also
elp to track emergency vehicles to skip traﬃc jams by analyzing
he traﬃc camera images in real-time. 
The modular approach explored by AXIOM is particularly well-
uited for tackling such challenging scenarios as it addresses the is-
ues derived from their computational complexity, distributed na-
ure, and need for synchronization among processes. Furthermore,
he AXIOM platform makes it possible to execute compute inten-
ive tasks on ARM with FPGA processing nodes. This will enable
ompanies such as Herta Security to deploy their real-time face
ecognition technology in crowded and changeable environments
sing multiple cameras simultaneously. 
.2. Smart-home 
Smart home means buildings empowered by ICT in the context
f the merging Ubiquitous Computing and the Internet of Things:
he generalization in instrumenting buildings with sensors, actu-
tors, cyber-physical systems allow to collect, ﬁlter and produce
ore and more information locally, to be further consolidated and
anaged globally according to business functions and services. A
mart home is one that uses operational and IT technologies and
rocesses to make it a better performing building - one that deliv-
rs lower operating costs, uses less energy, maximizes system and
quipment lifetime value, is cyber-secured and produces measur-
ble value for multiple stake holders. 
Major challenges in such environments concern cryptography,
elf-testing and ﬁrst of all sensor-networks management. Sensor
ata brings numerous computational challenges in the context of
ata collection, storage, and mining. In particular, learning from
ata produced from a sensor network poses several issues: sensors
re distributed; they produce a continuous ﬂow of data, eventually
t high speeds; they act in dynamic, time-changing environments;
he number of sensors can be very large and dynamic. These is-ues require the design of eﬃcient solutions for processing data
roduced by sensor-networks. 
AXIOM can help with preventive and interactive maintenance of
nfrastructures, climate and temperature management. This man-
gement can be remotely controlled helping to improve the en-
rgy eﬃciency at home, apartments and company oﬃce buildings.
or instance, AXIOM may detect patterns of behavior in a company
ﬃce building to adapt climate and light switching to the working
ay of life of the workers. 
The two application domains pose also common challenges
uch as, board-to-board communication and easy programmability.
urthermore, the two scenarios shown can easily converge, offering
pportunities for synergies and emerging services in the respective
omains. 
.3. Examples of use 
We are currently considering a wide range of potential uses
oth for Video-surveillance and for Smart-home. They range from
ynamic retail demand forecasting in train/bus station to Smart
arketing in shopping malls for Video-surveillance; and from
mart home comfort to Autonomous drone for infrastructure and
mart-home control. Here a taste of the scenarios, where the goals
re expressed in terms of the ﬁnal users of the enabling technology
s showed. A discussion of another scenario, part of our scenario
xploration, related to vehicle detection can also be found in an-
ther paper [11] . At the same time, these goals should match with
he challenges to AXIOM processing capabilities: 
• Dynamic retail demand forecasting. Due to the high ﬂuctua-
tion of passengers departing and arriving at train stations, de-
mand for station retailers varies strongly over time. By forecast-
ing such demand through video analysis, better services can be
provided through more eﬃcient staff utilization. The purpose of
this scenario is to provide retailers with a real-time forecast of
potential customers arriving at their outlets, to allow for better
task allocation and to increase business eﬃciency. 
• Smart marketing in shopping mall. Consumer behavior in a
shopping mall can be very eclectic yet the awareness of pat-
terns of behavior can be of help both to services providers and
to clients to meet their respective goals. Demographic analy-
ses is carried out over the captured facial snapshots, helping
to identify interesting facts such as the demographic proﬁles of
the customers, or how do they distribute into gender and age
segments. The visitors are tracked from one camera to another,
so as to discover the main paths they take through the mall and
how long they stay at different locations. The goal is to collect
statistical information about the visitors in order to deﬁne mar-
keting strategies both for service providers and for clients. 
• Smart home comfort. Comfort perception and necessities can
be different in respect of time of the day/week and to the char-
acteristics of the people actually living that space in that mo-
ment. The smart home is required to identify and manage the
different situations, and to react at the people indications in an
easy and smooth way. Networked sensors and actuators are dis-
tributed in each room embedded in ordinary appliances. The
appliances perform their primary normal function, but also col-
lect different kinds of information, ranging from presence de-
tection, temperature, humidity, window and door opening, air
quality, audio. The objective of the smart home comfort autopi-
lot is to minimize power consumption and to guarantee peo-
ple’s comfort and well being, without giving the impression of
reducing people freedom and capacity of control. 
• Autonomous rover/drone for infrastructure control. Preventive
maintenance is performed on equipment to keep it running
smoothly and eﬃciently and to help extend its life. Many types
268 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table 1 
Total number of additional lines of code com- 
pared to a baseline C implementation. 
Application Pthread Accel OmpSs 
Cholesky 26 71 3 
Covariance 29 94 3 
MxM 64x64 39 95 3 
MxM 32x32 39 95 3 
5
 
h  
2  
a  
s  
T  
w  
t  
t
 
N  
t  
T  
s  
b  
C  
(  
o  
t  
t
 
i  
l  
t  
c  
m  
w  
a
6
 
m  
g  
s
6
 
p  
q  
c  
t
 
p  
d  
p  
o  
t  
t
 
t  
v  
t  of equipment should be put on a preventive maintenance pro-
gram: HVAC systems, pumps and air compressors, air con-
ditioning, chillers and absorption equipment, elevators, safety
showers, back-ﬂow preventers, building exteriors, roofs, win-
dows, ﬁre doors and generators. Autonomous rovers and drone
furnished with thermo camera and ambient sensors can move
inside and outside a building monitoring the energy ﬂow, pro-
viding data for a multi-level energy ﬂow models that can be
used for preventive maintenance. The goal is maintaining build-
ing infrastructure eﬃcient, manage operating costs, and mini-
mizing potential downtime. It also ensures these components
perform within their originally designed operating parameters,
allowing data center managers the opportunity to replace com-
ponents before they fail. 
The software approach explored by AXIOM is particularly well-
suited for tackling such challenging scenarios, as it addresses the
issues derived from their computational complexity, distributed
nature, and need for synchronization among processes. Moreover,
we are considering some representative benchmarks to test drive
the design of the software stack that two partners already explored
in the ERA project [12,13] . 
Finally, it is worth mentioning that this project doesn’t address
the problem of maintaining or securing “sensible” data. In princi-
ple AXIOM is not collecting sensitive information, as per the deﬁ-
nition of sensitive information provided by EU Directive 95/46/EC.
However, according to the approved Commission Proposals on the
data protection reform, biometric data has to be considered sensi-
tive by default. This Regulation shall apply from 25 May 2018 but
the project considers since the beginning that biometric data col-
lected deserve that highest protection, at the same level of data re-
vealing racial or ethnic origin, political opinions, religious or philo-
sophical beliefs. Accordingly, procedures compliant with national
and EU legislation are followed to deal with data collection, stor-
age, protection, retention and destruction and conﬁrmation. 
Regarding the software developed in the presented scenarios, it
will rely on the Linux OS security layers already developed. As a
full Linux compliant architecture, the AXIOM architecture supports
the technical means to guarantee different privacy levels to pro-
tect the access to “sensible” plain data. Of course, it will also be
archived and distributed following national and EU legislation. 
5. Experimental setup 
In the ﬁrst year of the AXIOM project we want to properly eval-
uate the potential of the proposed hardware/software platform to
achieve the following goals: 
• Easy programmability of multi-core, multi-board, FPGA nodes
using the OmpSs programming model. 
• Reasonable performance and improved energy eﬃciency com-
pared against state-of-the-art systems. 
5.1. Benchmarks description 
Three benchmarks have been used for the analysis of easy
programmability when using the OmpSs@FPGA infrastructure: (1)
Cholesky matrix decomposition, working on a dense matrix of
6 4x6 4 double-precision complex numbers; (2) Covariance, working
on arrays of 32-bit integer complex numbers; and (3) Matrix mul-
tiplication, working on a matrix of single precision ﬂoating point
values (32 × 32 and 64 × 64 sizes). On the order hand, for perfor-
mance results the same matrix multiplication has been used with
a larger matrix 2048 × 2048, and different block sizes. .2. Hardware and software 
To perform the FPGA experiments showed in this article we
ave used a Zynq 706 board. The board includes a Zynq 7045 with
 ARM cores running at 800MHz and an FPGA that run at 200MHz
nd features 350K logic cells, 19.1Mb of block RAM and 900 DSP
lices. The SoC was released at 2012 and used 28nm technology.
iming of the applications has been obtained by instrumenting
ith gettimeofday the part of the code that calls several times
he kernel code while the power consumption was computed using
he tools provided by Xilinx. 
The OmpSs implementation is based on Mercurium 1.99.4 and
anos ++ 0.8. For the hardware compilation branch we have used
he Xilinx ISE Design 14.7 and the Vivado HLS 2013.2 tools.
he #pragma HLS pipeline II = 1 was used to parallelize the
econd loop of the matrix multiplication. All OmpSs codes have
een compiled with the arm-xilinx-linux-gnueabi-g ++ (Sourcery
odeBench Lite 2011.09–50) 4.6.1 and arm-xilinx-linux-gnueabi-gcc
Sourcery CodeBench Lite 2011.09–50) 4.6.1 compilers, with -O3
ptimization ﬂag. OmpSs runtime used an AXIOM premilinary pro-
otype of the NI interface. Results show the average elapsed execu-
ion time of 3 application executions. 
The machine used to obtain the GPP reference results was an
5-3470 with 4 cores running at 3.20GHz. The processor was se-
ected as it was released in Q2’12, close to the releasing time of
he Zynq 7045, and uses a 22nm technology. As with the ARM
odes, timing was measured with gettimeofday and power was
easured reading directly the processor hardware registers. Codes
ere compiled with gcc version 5.2.1 using -O3 optimization ﬂag
nd MKL version 11.2.3. 
. Results 
We have done some experiments for coding a set of bench-
arks in the Zynq platform and an initial evaluation of pro-
rammability cost in terms of number of lines of code, as a mea-
ure of programmability complexity. 
.1. Programmability analysis 
In order to have a good programmability analysis we have im-
lemented four different versions of each benchmark code: se-
uential code, pthread code, FPGA-accelerated code and OmpSs
ode. All versions of the codes consider the full Matrix Multiply,
he full Cholesky, and the full Covariance as tasks. 
We want to remark the programmability facilities of our pro-
osal. With this objective, Table 1 shows the total number of ad-
itional lines of code for each of the different versions of the ap-
lications, compared to the sequential version: a pthread version
nly running tasks in one or two ARM cores ( Pthread ), a sequen-
ial version using one or two hardware accelerators ( Accel ), and
he OmpSs version ( OmpSs ). 
The Pthread and Accel versions require more additional lines
han the OmpSs version. This is especially high in the sequential
ersions using the hardware accelerators. For the Pthread version
his is due to the additional calls to the Pthreads library, in order
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 269 
Fig. 8. Elapsed-time: 1/2 FPGA accelerators, up to 256 × 256. 
t  
i  
t  
a
 
m  
t  
t  
o  
t
 
t  
n  
o  
d  
l  
i  
b  
p  
s
 
s  
t  
s  
m  
i
6
 
P  
d  
g  
w
 
F
2  
6  
6  
t  
o  
a  
d  
O  
i  
b  
q
1  
×
m  
i  
Fig. 9. Elapsed-time: FPGA MxM versus SMP MxM (MKL). 
Fig. 10. Energy consumption: FPGA MxM versus SMP MxM (MKL). 
t  
s  
p  
t  
a  
o  
a  
i  
a  
i  
t
 
a  
i  
b  
a  
b  
w  
i  
i  
S  
t  
t  
o  
t  
p
 
1  
o  
f  
w  
F  
c  
l  
w  
c  o create, manage and join the pthreads. For the Accel version, this
s because the application needs to call the low-level infrastructure
o setup the communications layer with the FPGA and perform the
ctual data transfers back and forth to the FPGA hardware. 
On the other hand, in the case of the OmpSs version, the thread
anagement, the setup of the communications and data transfers
o and from the FPGA are all done internally by the Nanos ++ run-
ime. This way, the programmer does not need to write any line
f code related to low level management, but only the directives
riggering the communications. 
Indeed, the current compilation and runtime infrastructure of
he OmpSs programming model allows to exploit the heteroge-
eous characteristics of the Zynq All-Programmable SoC with the
nly effort of two directive lines. Note however that Table 1 in-
icates that the OmpSs version needs an additional third line. This
ine is a taskwait before the program ends, as it can be observed
n Fig. 5 . Actually, the code showed in Fig. 5 is used to generate
oth the 32 × 32 and the 64 × 64 versions of the matrix multi-
lication, using all the available resources (ARM cores and FPGA),
imply by redeﬁning the BS variable as 32 or 64 elements. 
For the Pthreads and Accel versions however, different block
izes need new scheduling schemes, adding more complexity to
he transformation of the code. Indeed, implementing a fourth ver-
ion of the code managing heterogeneous executions would require
ore development time and additional lines that the ones showed
n Table 1 . 
.2. Performance results 
In order to study the suitability of our approach to the High
erformance Computing (HPC) environment, it is necessary to
emonstrate that our systems is not only able to be easily pro-
rammed but also that it can achieve a reasonable performance
hen compared to other current state-of-the-art approaches. 
First, an evaluation of the best accelerator size for the selected
PGA was performed. Fig. 8 shows the elapsed time for a 2048 ×
048 matrix multiplication using 1/2 accelerators of sizes (blocks)
 4 × 6 4, 128 × 128 and 256 × 256. Results show that using 1/2
 4 × 6 4 accelerators are the worst choice. This accelerator size is
oo small for the problem since the data transfer to/from the FPGA
vercomes the computational beneﬁts of using the FPGA. Indeed,
s the communication channel is shared, using two accelerators
oes not improve the performance that is bounded by the DMA.
n the other hand, there is a signiﬁcant improvement when mov-
ng to 128 × 128 accelerators. Those bigger accelerators compute
locks of four times the size of 64 × 64 accelerators and conse-
uently the data movements are divided by four. Therefore, 128 ×
28 accelerators are also 8 times more time consuming than 64
64 accelerators since doubling the block size means eight times 
ore multiplications, and then, using two accelerators can help to
mprove the performance. However, due to FPGA limited resources,he compiler is not able to make the two accelerators, sharing re-
ources, as fast as only one, using all the resources. This limit ex-
lains why the largest accelerator (blocks of 256 × 256) is not able
o be as fast as two 128 × 128 ones. Although the data transfers
re again divided by four the accelerator is six times slower than
ne 128 × 128 due to the limited resources and this results in
 worse overall performance. One not so obious, but nevertheless
mportant result of Fig. 8 is that all the accelerators were compiled
nd executed using the same source code (listed in Fig. 5 ) chang-
ng only the block size ( BS ). Both the compiler and the runtime
ake care of all the details. 
Fig. 9 shows the time in microseconds that takes to compute
 2048 × 2048 matrix multiplication, using the best block size,
n two different systems. Columns named i5 show the time used
y the Corei5 machine described in Section 5.2 when using 1, 2
nd 4 cores respectively with the sgemm function of the MKL li-
rary. Column 706 shows the time used by the same computation
hen it is performed in a Zynq 706 board using the code showed
n Fig. 5 and the OmpSs compilation and execution framework. As
t can be seen the FPGA board is competitive with the conventional
MP with a result between 1 and 2 Corei5 cores in performance
erms. Fig. 10 shows the energy consumption of the same compu-
ation in the same machines. As it shows the FPGA system clearly
utperforms the conventional SMPs in terms of energy consump-
ion which shows that our approach is promising for future com-
uting systems. 
What is more important from the results showed in Figs. 9 and
0 is not that an FPGA of an older technology process can clearly
utperform in terms of energy a conventional processor but the
act that writing the code for the FPGA was actually simpler than
riting the code for the SMP. Indeed, as mentioned above the
PGA code was directly the one showed in Fig. 5 while the SMP
ode was changed to call the parallel sgemm version of the MKL
ibrary instead of the original matrix multiply function. So, OmpSs
as not used for the Core i5 version. Arguably, the change was not
umbersome neither extensive but the fact is that the naive origi-
270 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a  
t
 
g  
h  
t  
t  
a  
l  
t  
i  
a
 
i  
w  
i  
O  
S  
n
 
v  
o  
(  
l  
r  
p  
t  
m  
w  
t  
m  
l  
b
 
C  
p  
e  
i  
a  
i  
o  
a
 
a  
u  
p  
t  
t  
b  
v  
w
8
 
a  
o  
w  
i  
a  
s  
o  
s
 
b  
i  nal version of the MxM code, although compiled with the -O3 op-
timization ﬂag, performed much (36 × slowdown) worse than the
MKL sgemm implementation forcing us to change the code to pro-
vide a fair evaluation. In our opinion this highlights the potential of
using the OmpSs programming model for heterogeneous systems. 
7. Related work 
The AXIOM project will exploit the OmpSs dataﬂow features in
the AXIOM heterogeneous architecture. OmpSs is the result of the
integration of StarSs [14] and OpenMP. 
In this section we discuss some work that has been fundamen-
tal for the development of this project and provided the neces-
sary inspiration and vision to develop some basic concepts related
to the dataﬂow execution model. Dataﬂow execution model had
been studied since long time ago [15] as they provide a simple
an elegant way to eﬃciently move data from one computational
thread to another one [16,17] . In the context of the TERAFLUX
project [9,18] such dataﬂow model had been extended to multi-
ple nodes executing seamlessly thanks to the support of an ap-
propriate memory model [7,10] . In such memory model a com-
bination of consumer-producer patterns [8,19] and transactional
memory [20,21] permits a novel combination of dataﬂow concepts
and transactions in order to address the consistency across nodes,
where each node is assumed to be cache-coherent, i.e., like in
a classical multi-core. Dataﬂow models also allow the system to
take care in a distributed way of faults that may compromise a
node [22,23] . 
In order to integrate heterogeneous execution of the same
applications over processors and FPGA fabric, OmpSs@FPGA is a
key point in the project. Although to the best of our knowledge
OmpSs@FPGA [24, 25] is the ﬁrst successful attempt to implement
hardware accelerators from high-level directives in a total trans-
parent way, other approaches have been used in the past. Some
tools try to reduce the FPGA programmability problem by offering
the possibility of generating HDL code from C or C-like languages
like ROCCC [26,27] or generating systems with an embedded soft
processor connected to the generated hardware accelerators like
LegUp [28] and C2H tool [29] . However, with the new SMP/FPGA
SoCs, new strategies are required in order to exploit those current
heterogeneous and parallel platforms. Our ecosystem also covers
runtime support for parallel execution of heterogeneous tasks on
those SoCs, unlike other. 
PGI [30] and HMPP [31] programming models are two other ap-
proaches quite related to OmpSs. PGI uses compiler technology to
oﬄoad the execution of loops to the accelerators. HMPP also an-
notates functions as tasks to be oﬄoaded to the accelerators. We
think that OmpSs has higher potential in that it shifts part of the
intelligence that HMPP and PGI delegate in the compiler to the
OmpSs runtime system. Although these alternatives do support a
fair amount of asynchronous computations expressed as futures
or continuations, the level of lookahead they support is limited in
practice. 
To execute over several nodes, OmpSs@cluster [32] is one of
the alternatives explored in the project. As alternatives, Partitioned
Global Address Space (PGAS) programming models expose an ab-
stracted shared address space to the programmer simplifying its
task, while data and thread locality awareness is kept to en-
hance performance. Representative PGAS languages are UPC [33] ,
and X10 [34] ; and Chapel [35] , which implement Asynchronous
PGAS model, offering asynchronous parallelism. An alternative way
to provide asynchronous parallelism on clusters is a hybrid pro-
gramming model that composes SMPSs [36] , that inspired OmpSs,
with MPI. The main idea is to encapsulate the communications in
tasks so they are executed when the data is ready. This techniquechieves an asynchronous dataﬂow execution of both communica-
ion and computation. 
OpenCL [37] attempts to unify the programming models for
eneral-purpose multi-core architectures and the different types of
ardware accelerators (Cell B.E., GPUs, FPGAs, DSPs, etc.). The par-
icipation of silicon vendors (e.g., Intel, IBM, NVIDIA, and AMD) in
he deﬁnition of this open standard ensures portability, low-level
ccess to the hardware, and supposedly high performance. We be-
ieve, however, that OpenCL still exposes much of the low-level de-
ails (i.e. explicit platform and context management, kernel special
ntrinsic functions, explicit program, kernel and data transfer man-
gement, etc.), making it cumbersome to use by non-experts. 
Another alternative for mutli-node programming is DSM. DSM
s a recently revived topic [38] . Some attempts for creating Soft-
are DSM implementations for Linux have been carried out dur-
ng the last decades. Examples are Treadmarks (TMK), JIAJIA [39] ,
mni/SCASH [40,41] , Jump [42,43] , Parade [44,45] , NanosDSM [46] .
ome of these projects only supported very speciﬁc hardware, and
one of them has been maintained during the last decade. 
Regarding applications, state-of-the-art implementations of
ideo-surveillance or voice-identiﬁcation scenarios currently rely
n machine learning techniques based on deep neural networks
DNNs). As recent studies have pointed out, DNNs are particu-
arly good for addressing computer vision image classiﬁcation and
ecognition problems exhibiting highly non-linear properties. Ap-
lications ranging from face recognition [47] and age/gender es-
imation [48] to pedestrian detection [49] have experienced dra-
atic improvements in terms of accuracy just by training DNNs
ith huge amounts of data. Due to the architectural properties of
hese models and the advances in HPC, it is now cost-effective to
assively scale the infrastructure to train such networks with mil-
ions of sample images that have been previously manually tagged
y humans on the widely-available social networking services. 
The proliferation of frameworks and libraries such as
affe [50] and cuDNN [51] have democratized the usage of
arallelized DNN-based solutions on GPU architectures. How-
ver, there is a lack of ready-to-deploy implementations of DNN
nference engines for embedded platforms powered by FPGA
ccelerators. Since DNN evaluation is highly parallel in nature,
t is feasible to oﬄoad all the required SGEMM matrix multiply
perations to FPGAs, and also to execute forward propagation in
n eﬃcient manner through the OmpSs programming model. 
Once DNN models are trained as a result of a process that usu-
lly takes several days on a GPU cluster, it is then possible to eval-
ate them on the AXIOM board. With this idea in mind, we aim to
roduce a generic easy-to-use, low-power hardware/software stack
o cheaply deploy machine learning solutions based on DNNs in-
eracting with the Cyber-Physical world. This ecosystem powered
y the AXIOM platform is expected to solve a myriad of computer
ision problems, and thus dramatically improve productivity on a
ide range of industries. 
. Conclusions and future work 
In this paper, we have presented the software layers that we
re developing on the AXIOM H2020 European Project. The main
bjective of the project is to bring to reality a novel small board
hich aims at becoming a very powerful basic brick of future
nterconnected and scalable embedded Cyber-physical systems,
nd speciﬁcally we focus on the application domains of Video-
urveillance, deep learning and Smart-home. The module consists
f both hardware and software that will be designed and demon-
trated in the project. 
On one hand, the target board architecture will be a board
ased on a SoC with several ARM cores and an FPGA, like the Xil-
nx Zynq, and with the Arduino interface to be extensible. The AX-
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 271 
I  
c  
c  
w  
p  
s  
o  
T  
a  
p  
n
 
a  
i  
t  
t  
a  
g  
t  
t  
t  
f  
a
 
g  
p  
s  
l  
b  
t  
m  
p
 
p  
p  
i  
c
A
 
H  
6  
g  
T
(
R
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
[  
 
[  
 
[  
 
 
 
 
[  
 
 
[  
 
 
[  
 
 
 
[
[  
 
 
[  
 
 
 
[  
 
 
 
 
[  
 
[  
 
 
[OM system will comprise several of such boards linked through
ustom communication links, and providing application memory
oherence at software level. On the other hand, we will research
ays to easy programmability of the system, based on the OmpSs
rogramming model and DSM-like techniques to achieve a global
ystem image for applications. Currently, we are in the process
f designing a high-speed communications layer between boards.
hese communication will be implemented using the transceivers
vailable in the Zynq SoC. We have also started looking at the ap-
lication requirements to ensure that our platform ﬁts with their
eeds. 
The expected impacts obtained from the AXIOM project include
 platform interfacing with the physical world through compatibil-
ty with Arduino shields. This platform will be aimed to become
he hardware and software platform for large scale production. In
his sense we want to develop an autonomous technology that is
ble to break the Embedded Systems energy eﬃciency and pro-
rammability barriers. The same set of technologies are expected
o represent the base for future European industrial exploitation in
he HPC and Embedded Computing markets. Finally, it is expected
o provide the basis for a new European-level research at the fore-
ront of the development of extreme-performance system software
nd tools. 
Our preliminary experiments have shown that the OmpSs pro-
ramming model increases the expressiveness of serial or pthreads
rogramming, thus allowing developers to focus on solving the is-
ues related to the algorithms, instead of dealing with the low-
evel details of the communications among boards or data transfers
etween the cores and the embedded FPGA. Also we show that
his easiness in programmability is joined by competitive perfor-
ance and lower energy consumption when compared to standard
rocessors. 
The key features of the project presented in this paper are the
ossibility to modularly enhance the capabilities of the board, im-
rove its interface with the physical world, ﬂexibly reconﬁguring
t for accelerating speciﬁc functions, while providing energy eﬃ-
iency and easy programmability. 
cknowledgment 
We thankfully acknowledge the support of the European Union
2020 program through the AXIOM project (grant ICT-01-2014 GA
45496 ), the Spanish Government, through the Severo Ochoa pro-
ram (grant SEV2015-0493) the Spanish Ministry of Science and
echnology ( TIN2015-65316-P ) and the Generalitat de Catalunya 
 MPEXPAR, 2014-SGR-1051 and 2014-SGR-1272). 
eferences 
[1] A. Goransson , D.C. Ruiz , Professional Android Open Accessory Programming
with Arduino, John Willey & Sons, 2013 . 
[2] S. Monk , Programming Arduino Next Steps: Going Further with Sketches, 1st,
McGraw-Hill Professional, USA, 2013 . 
[3] E. Ayguadé, R.M. Badía , D. Cabrera , A. Durán , M. González , F. Igual , D. Jiménez ,
J. Labarta , X. Martorell , R. Mayo , J.M. Pérez , E.S. Quintana-Orti , A proposal to
extend the openMP tasking model for heterogeneous architectures, in: IWOMP,
5568, 2009, pp. 154–167 . 
[4] V. Pillet, J. Labarta, T. Cortes, S. Girona, PARAVER: a Tool to Visualize and An-
alyze Parallel Code Technical Report UPC-CEPBA-95-03, European Center for
Parallelism of Barcelona (CEPBA), Universitat Politècnica de Catalunya (UPC),
1995. 
[5] R. Ferrer, S. Royuela, D. Caballero, A. Durán, X. Martorell, E. Ayguadé, Mer-
curium: design decisions for a s2s compiler, Cetus Users and Compiler Infas-
tructure Workshop in conjunction with PACT 2011, 2011. 
[6] R. Giorgi, A. Scionti, A scalable thread scheduling co-processor based on data-
ﬂow principles, ELSEVIER Future Gener. Comput. Syst. (0) (2015) 1–10, doi: 10.
1016/j.future.2014.12.014 . 
[7] R. Giorgi, iTERAFLUX: exploiting dataﬂow parallelism in teradevices, in: ACM
Computing Frontiers, 2012, pp. 303–304, doi: 10.1145/2212908.2212959 . 
[8] N. Ho, A. Mondelli, A. Scionti, M. Solinas, A. Portero, R. Giorgi, Enhancing
an x86_64 multi-core architecture with data-ﬂow execution support, in: ACM
Proc. of Computing Frontiers, 2015, pp. 1–2, doi: 10.1145/2742854.2742896 . [9] R. Giorgi , et al. , TERAFLUX: harnessing dataﬂow in next generation teradevices,
Microprocess. Microsyst. 38 (8, Part B) (2014) 976–990 . 
[10] R. Giorgi, P. Faraboschi, An introduction to df-threads and their execution
model, in: IEEE MPP, 2014, pp. 60–65, doi: 10.1109/SBAC-PADW.2014.30 . 
[11] G. Burresi , R. Giorgi , A ﬁeld experience for a vehicle recognition system using
magnetic sensors, in: IEEE MECO, 2015, pp. 1–6 . 
[12] N. Puzovic, S. McKee, R. Eres, A. Zaks, P. Gai, W. S., R. Giorgi, A multi-pronged
approach to benchmark characterization, in: IEEE CLUSTER, 2010, pp. 1–4,
doi: 10.1109/CLUSTERWKSP.2010.5613090 . 
[13] A. Scionti , S. Kavvadias , R. Giorgi , Dynamic power reduction in self-adap-
tive embedded systems through benchmark analysis, in: IEEE MECO, 2014,
pp. 62–65 . 
[14] J. Planas , R. Badía , E. Ayguadé, J. Labarta , Hierarchical task-based programming
with StarSs, Int. J. High Perform. Comput. Appl. 23 (3) (2009) 284–299 . 
[15] F. Yazdanpanah, C. Álvarez-Martínez, D. Jiménez-González, Y. Etsion, Hybrid
dataﬂow/von-neumann architectures, IEEE Trans. Parallel Distrib. Syst. 25 (6)
(2014) 1489–1509, doi: 10.1109/TPDS.2013.125 . 
[16] L. Verdoscia, R. Vaccaro, R. Giorgi, A clockless computing system based on
the static dataﬂow paradigm, in: Proc. IEEE Int.l Workshop on Data-Flow Ex-
ecution Models for Extreme Scale Computing (DFM-2014), 2014, pp. 30–37,
doi: 10.1109/DFM.2014.10 . 
[17] L. Verdoscia, R. Vaccaro, R. Giorgi, A matrix multiplier case study for an evalua-
tion of a conﬁgurable dataﬂow-machine, in: ACM CF’15 - LP-EMS, 2015, pp. 1–
6, doi: 10.1145/2742854.2747287 . 
[18] M. Solinas , et al. , The TERAFLUX project: Exploiting the dataﬂow paradigm in
next generation teradevices, in: DSD, 2013, pp. 272–279 . 
[19] N. Ho, A. Portero, M. Solinas, A . Scionti, A . Mondelli, P. Faraboschi, R. Giorgi,
Simulating a multi-core x86_64 architecture with hardware isa extension sup-
porting a data-ﬂow execution model, in: IEEE Proceedings of the AIMS-2014,
Madrid, Spain, 2014, pp. 264–269, doi: 10.1109/AIMS.2014.41 . 
20] R. Giorgi , Accelerating haskell on a dataﬂow architecture: a case study includ-
ing transactional memory, in: CEA, 2015a, pp. 91–100 . 
[21] R. Giorgi , Transactional memory on a dataﬂow architecture for accelerating
haskell, WSEAS Trans. Comput. 14 (2015b) 794–805 . 
22] S. Weis , A. Garbade , J. Wolf , B. Fechner , A. Mendelson , R. Giorgi , T. Ungerer , A
fault detection and recovery architecture for a teradevice dataﬂow system, in:
IEEE DFM), 2011, pp. 38–44 . 
23] S. Weis , et al. , Architectural support for fault tolerance in a teradevice dataﬂow
system, Springer Int’l J. Parallel Program. (2014) 1–25 . 
[24] A. Filgueras, E. Gil, D. Jiménez-González, C. Álvarez, X. Martorell, J. Langer,
J. Noguera, K. Vissers, Ompss@zynq all-programmable soc ecosystem, in:
Proceedings of the 2014 ACM/SIGDA International Symposium on Field-
programmable Gate Arrays, in: FPGA ’14, ACM, New York, NY, USA, 2014,
pp. 137–146, doi: 10.1145/2554688.2554777 . 
25] A. Filgueras, E. Gil, C. Álvarez, D. Jiménez-González, X. Martorell, J. Langer,
J. Noguera, Heterogeneous tasking on smp/fpga socs: The case of ompss and
the zynq, in: 2013 IFIP/IEEE 21st International Conference on Very Large Scale
Integration (VLSI-SoC), 2013, pp. 290–291, doi: 10.1109/VLSI-SoC.2013.6673293 . 
26] J.R. Villarreal , A. Park , W.A. Najjar , R. Halstead , Designing modular hardware
accelerators in c with roccc 2.0., in: R. Sass, R. Tessier (Eds.), FCCM, IEEE Com-
puter Society, 2010, pp. 127–134 . 
[27] W.A. Najjar , J.R. Villarreal , Fpga code accelerators - the compiler perspective,
in: DAC, 2013, p. 141 . 
28] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J.H. Anderson, S. Brown,
T. Czajkowski, Legup: High-level synthesis for fpga-based processor/accelerator
systems, in: Proceedings of the 19th ACM/SIGDA International Symposium on
Field Programmable Gate Arrays, in: FPGA ’11, ACM, New York, NY, USA, 2011,
pp. 33–36, doi: 10.1145/1950413.1950423 . 
29] Altera, Corp., Nios II C2H Compiler User Guide, 2009. URL: www.altera.com 
30] PGI Accelerator Programming Model for Fortran & C, The Portland Group, 2010.
[31] R. Dolbeau , S. Bihan , F. Bodin , HMPP: a hybrid multi-core parallel program-
ming environment, First Workshop on General Purpose Processing on Graphics
Processing Units, 2007 . 
32] J. Bueno , L. Martinell , A. Durán , M. Farreras , X. Martorell , R.M. Badía ,
E. Ayguadé, J. Labarta , Productive cluster programming with ompss, in: Pro-
ceedings of the 17th International Conference on Parallel Processing - Volume
Part I, Euro-Par’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 555–566 . 
[33] UPC Consortium, UPC Language Speciﬁcations v1.2, Report Number: LBNL-
59208, 2005. 
34] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von
Praun, V. Sarkar, X10: an object-oriented approach to non-uniform clus-
ter computing, SIGPLAN Not. 40 (10) (2005) 519–538, doi: 10.1145/1103845.
1094852 . 
[35] B. Chamberlain, D. Callahan, H. Zima, Parallel programmability and the chapel
language, Int. J. High Perform. Comput. Appl. 21 (3) (2007) 291–312, doi: 10.
1177/1094342007078442 . 
36] V. Marjanovic, J. Labarta, E. Ayguadé, M. Valero, Effective communication and
computation overlap with hybrid mpi/smpss, SIGPLAN Not. 45 (5) (2010) 337–
338, doi: 10.1145/1837853.1693502 . 
[37] Khronos OpenCL Working Group, The OpenCL Speciﬁcation, version 1.2, 2011.
URL https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf 
38] S. Kaxiras , D. Klaftenegger , M. Norgren , A. Ros , K.F. Sagonas , Turning central-
ized coherence and distributed critical-section execution on their head: A new
approach for scalable distributed shared memory, in: Proc. of HPDC, 2015,
pp. 3–14 . 
39] Jiajia, http://www-users.cs.umn.edu/ ∼tiane/paper/dist.htm . 
272 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 [40] Omni/scash http://www.pcs.cs.tsukuba.ac.jp/omni-compiler/doc/omniscash.
html . 
[41] M. Hess , G. Jost , M. Müller , R. Rühle , Experiences using OpenMP based on
Compiler Directed Software DSM on a PC Cluster, Workshop on OpenMP Ap-
plications and Tools (WOMPAT’02, 2002 . 
[42] The jump software dsm system, http://www.snrg.cs.ku.hk/srg/html/jump.htm . 
[43] C.L.W.B. Cheung , K. Hwang , Jump-dp: A software dsm system with low-latency
communication support, PDPTA, 20 0 0 . 
[44] Parade, http://peace.snu.ac.kr/researc/parade/ . 
[45] Y. Kee , J. Kim , S. Ha , ParADE: an OpenMP Programming Environment for SMP
Cluster Systems, Supercomputing 2003 (SC’03), 2003 . 
[46] J.J. Costa , T. Cortes , X. Martorell , E. Ayguadé, J. Labarta , Paper running openmp
applications eﬃciently on an everything-shared sdsm, (JPDC) 6 (5) (2006)
647—658 . 
[47] Y. Taigman , M. Yang , M. Ranzato , L. Wolf , Deepface: Closing the gap to hu-
man-level performance in face veriﬁcation, in: IEEE Conference on Computer
Vision and Pattern Recognition, IEEE, 2014, pp. 1701–1708 . 
[48] I. Huerta , C. Fernández , C. Segura , J. Hernando , A. Prati , A deep analysis on age
estimation, Pattern Recognit. Lett. 68 (2015) 239–249 . [49] A . Angelova , A . Krizhevsky , V. Vanhoucke , A . Ogale , D. Ferguson , Real-time
pedestrian detection with deep network cascades, in: Proceedings of BMVC
2015, 2015 . 
[50] Y. Jia , E. Shelhamer , J. Donahue , S. Karayev , J. Long , R. Girshick , S. Guadar-
rama , T. Darrell , Caffe: convolutional architecture for fast feature embedding,
in: Proceedings of the ACM International Conference on Multimedia, ACM,
2014, pp. 675–678 . 
[51] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
E. Shelhamer, cuDNN: eﬃcient primitives for deep learning, arXiv preprint
arXiv:1410.0759, 2014. URL http://arxiv.org/abs/1410.0759 
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 273 
mputer Science from the Technical University of Catalunya (UPC) in 1998 and 2007, re- 
stant Professor in the Computer Architecture Department at UPC, BarcelonaTech, and is 
mming Models Department at BSC-CNS. His research interests cover the areas of paral- 
utions for high-performance multiprocessor systems. He has co-authored more than 40 
e is currently advising 1 PhD student and has co-advised 2 PhD theses. He has been 
the TERAFLUX and AXIOM European projects. 
ommunications in 1986 and the Ph.D. degree in Computer Science in 1989, both from the 
987 Prof. Ayguad has been lecturing at the Computer Science School (FIB) and Telecom- 
rently, and since 1997, he is full professor of the Computer Architecture Department at 
ate and graduate) courses related with computer organization and architecture, parallel 
uad is also involved in the Computer Architecture and Technology PhD Program at UPC, 
topics related with his research interests: multicore architectures, parallel programming 
 HPC architectures. In these research topics, Prof. Ayguad has published more than 300 
e framework of the European Union and research collaborations with companies related 
 Samsung). Currently Prof. Ayguad is associated director for research on the Computer 
nter (BSC-CNS), the National Center for Supercomputing in Spain located in Barcelona. 
ience at the Barcelona School of Informatics (FIB) of Universitat Politècnica de Catalunya 
r in High Performance Computing from the same School. Currently, he is working at the 
 Center - Centro Nacional de Supercomputacin (BSC-CNS). 
ience from the Technical University of Catalonia (UPC). He became involved in research 
European Center of Parallelism of Barcelona (CEPBA) working with Software Distributed 
 researcher at the Barcelona Supercomputing Center (BSC) and continued his work on 
is thesis, which provided the OmpSs programming model with support for clusters of 
en applied to different research projects such as the Mont-Blanc2 project. His current 
ools to ease the complexity of developing applications for modern HPC systems. 
uter science at the Barcelona School of Informatics (FIB) of Universitat Politècnica de 
 studying a Master degree in High Performance Computing at the same school, while is 
upercomputing Center - Centro Nacional de Supercomputacin (BSC-CNS). 
at Universitat Politècnica de Catalunya - BarcelonaTech (UPC) in 2012. Currently working 
puting Center and particimating in the AXIOM European project. His research interests 
s for high performance computing and programmability of those. Carlos Álvarez received the M.S. and Ph.D. degrees in Co
spectively. He currently holds a position as Tenured Assi
a associated researcher at the Computer Sciences -Progra
lel architectures, runtime systems and reconﬁgurable sol
publications in international journals and conferences. H
participating in the HiPEAC Network of Excellence and in 
Eduard Ayguadé received the Engineering degree in Telec
Universitat Politècnica de Catalunya (UPC), Spain. Since 1
munications Engineering (ETSETB) both in Barcelona. Cur
UPC. Prof. Ayguad has lectured a number of (undergradu
programming models and their implementation. Prof. Ayg
where he has (co-)advised more than 20 PhD thesis, in 
models and their architectural support and compilers for
papers and participated in several research projects in th
with HPC technologies (IBM, Intel, Nvidia, Microsoft and
Sciences Department at the Barcelona Supercomputing Ce
Jaume Bosch completed engineers degree in computer sc
- BarcelonaTech (UPC) in 2015 and he is studding a Maste
Programming Models Group of Barcelona Supercomputing
Javier Bueno Hedo holds a PhD. degree in Computer Sc
in 2004, when he started as a part-time student in the 
Memory Systems. In 2006 he became a full-time junior
distributed systems. From 2010 to 2015 he worked on h
multi-cores and clusters of GPUs. This work has also be
research aims to produce new programming models and t
Artem Cherkashin completed engineers degree in comp
Catalunya - BarcelonaTech (UPC) in 2015. Currently, he is
working at the Programming Models Group of Barcelona S
Antonio Filgueras received a degree in computer science 
at the Programming Models group of Barcelona Supercom
are focused on heterogeneous and reconﬁgurable solution
274 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
rees in Computer Science from the Technical University of Catalunya (UPC) in 1997 and 
ed Assistant Professor in the Computer Architecture Department at UPC, BarcelonaTech, 
Programming Models Department at BSC-CNS. His research interests cover the areas of 
onﬁgurable solutions for high-performance multiprocessor systems. Dr. Jimenez-Gonzalez 
 journals and conferences. He is currently co-advising 1 PhD students and has co-advised 
etwork of Excellence and in the SARC, ACOTES, TERAFLUX, AXIOM and PRACE European 
 Computer Science from the Technical University of Catalunya (UPC) in 1991 and 1999, 
ems, parallel runtime systems and OS administration. He has been an associate professor 
1. His research interests cover the areas of operating systems, runtime systems, compilers 
stems. Dr. Martorell has participated in several long-term research projects with other 
 the European Union ESPRIT, IST and FET programs. He spent one year working with the 
oauthored more than 60 publications in international journals and conferences. He has 
 3 PhD students. He is currently the Manager of the Parallel Programming Models team 
ticipating in the HiPEAC Network of Excellence and in the SARC, ACOTES, and Intone, POP, 
OM European projects. 
Professor at the Universitat Politecnica de Catalunya (UPC), Barcelona, Spain, and Senior 
), serving as manager of the Accelerators for High Performance Computing group. He 
urrent interests include: GPGPU computing, multi-core computer architectures, hardware 
ry management and runtime optimizations. He is also doing research on massively par- 
versity of Illinois (IMPACT Research Group). Prof. Navarro is a member of IEEE, the IEEE 
ce in 2015 from the Technical University of Catalunya (UPC). Currently he is studying a 
e Programming Models group at Barcelona Supercomputing Center (BSC-CNS) within the 
d on parallel architectures, multiprocessor systems, and heterogeneous and reconﬁgurable 
ir use on bioinformatics applications. 
egree) and M.Sc degree respectively from the Electronic and Computer Engineering de- 
 2007, he joined the Computer Engineering department of the Delft University of Tech- 
011, he joined the Computer Architecture and VLSI Systems group at the Foundation for 
re he is working as a post-doc researcher for national and international research projects. 
stems, Computer Architecture, and Reconﬁgurable computing. 
 of the Electronic and Computing Engineering Department, Technical University of Crete 
 Systems (CARV) Laboratory of the Institute of Computer Science, FORTH in Greece. He 
epartment of Computer Science, University of Crete in 1989 and M.Sc. and Ph.D. degrees 
 Science, University of Wisconsin-Madison in 1991 and 1995 respectively. His research 
e, where he investigates the Design and Implementation of High-Performance and Cost- 
able Computing. Daniel Jiménez-González received the M.S. and Ph.D. deg
2004, respectively. He currently holds a position as Tenur
and is a associated researcher at the Computer Sciences-
parallel architectures, runtime systems, compilers and rec
has coauthored more than 40 publications in international
2 PhD student. He has been participating in the HiPEAC N
projects. 
Xavier Martorell received the M.S. and Ph.D. degrees in
respectively. Since 1992 he has lectured on operating syst
in the Computer Architecture Department at UPC since 200
and applications for high-performance multiprocessor sy
universities and industries, primarily in the framework of
BG/L team in the IBM Watson Research Center. He has c
co-advised three Ph.D. theses and he is currently advising
at the Barcelona Supercomputing Center. He has been par
ENCORE, MontBlanc (I and II), DEEP/DEEP-ER and the AXI
Nacho Navarro (1958–2016, in memoriam) is Associate 
Researcher at the Barcelona Supercomputing Center (BSC
holds a Ph.D. degree in Computer Science from UPC. His c
accelerators, dynamic reconﬁgurable logic support, memo
allel accelerators like GPUs in collaboration with the Uni
Computer Society, the ACM and the HiPEAC NoE. 
Miquel Vidal received the B.S. degree in Computer Scien
M.S. in High-Performance Computing while working at th
AXIOM European project. His research interests are focuse
solutions for high-performance computing; as well as the
Dimitris Theodoropoulos obtained his Diploma (5-year d
partment at the Technical University of Crete, Greece. In
nology, the Netherlands, where he received his PhD. In 2
Research and Technology - Hellas (FORTH) in Greece, whe
His research interests are in the domains of Embedded Sy
Dionisios Pnevmatikatos is a Professor and former Chair
and a Researcher at the Computer Architecture and VLSI
received his B.Sc. degree in Computer Science from the D
in Computer Science from the Department of Computer
interests are in the broader area of Computer Architectur
Effective Systems, Reliable System Design, and Reconﬁgur
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 275 
at SECO. He graduated in electronic engineering at University of Florence in 2006 with a 
sed on the implementation of an USB macrocell on FPGA. He joined SECO in 2006 and is 
g on industrial applications. Davide contributed to hardware development of the systems 
s at BSC and to CARMA and Kayla platforms aimed to develop CUDA based applications 
er Science from Universitat Politècnica de Catalunya (UPC) in 2006, and an M.S. degree 
ted his professional career in 2005 working as a consultant in performance monitoring 
logy Centre where he held a research position on online banking cybercrime mitigation 
ading the GPU parallelization of several products. He has published several papers in 
tents. His research interests include computer architecture, GPU computing and malware 
 Eng. and M.S. in Language and Speech from the Technical University of Catalonia (UPC) 
I from the Autonomous University of Barcelona (UAB) in 2008, where he obtained his 
 Ph.D. Award. He has published more than 40 scientiﬁc articles in international journals 
 Herta Security. His research interests include biometrics, computer vision, and machine 
e and video. 
communication Engineering at the Universitat Politècnica de Catalunya (UPC) in 2003, 
-Berlin) in 2003 and the Ph.D. cum laude degree in Computer Science from the UPC 
n in 2003 and in UPC from 2005 to 2011. Later he joined the company Herta Security 
novation until 2015. Currently he is working in Telefnica I+D as a speech scientist. He 
ree EU research projects, and has published more than twenty scientiﬁc papers in peer- 
nces. His research interests include speaker localization and tracking, multimedia signal 
 degrees in Telecommunication Engineering from the Technical University of Catalonia, 
e has also received the B.A. degree in Business Administration by the Open University of 
In 20 0 0 he worked for Robert Bosch, GmbH, in Hildesheim (Germany). In 2001, he joined 
e was the R&D Manager. He founded Herta Security in 2009 and became the CEO of the 
rent magazines and workshops, and he holds three patents. His main research interests 
etric systems and applications. 
lecommunication engineering from the Technical University of Catalonia (UPC), Barcelona, 
 been with the Department of Signal Theory and Communications, UPC, where he is a 
uage and Speech (TALP). He was a Visiting Researcher at the Panasonic Speech Technol- 
ear 20 0220 03. His research interests include robust speech analysis, speech recognition, 
ultimodal interfaces. He is the author or coauthor of about two hundred publications in 
these topics. He has led the UPC team in several European, Spanish and Catalan projects. 
rd of UPC. Davide Catani is R&D manager for ARM-based platforms 
graduation thesis developed at Cesvit Microelettronica focu
developing ARM-based systems since 2010, mainly focusin
used to build Tibidabo and Pedraforca ARM-based cluster
on ARM-based systems. 
David Oro received the B.S, and M.S. degrees in Comput
in Computer Architecture in 2011, also from UPC. He star
solutions. In 2009, he joined the Barcelona Digital Techno
for CaixaBank. Currently, he works for Herta Security le
international peer-reviewed conferences and holds two pa
analysis. 
Carles Fernández received his B.S. in Telecommunication
in 2005. He received an M.S. in Computer Vision and A
Ph.D. cum laude in 2010, receiving the 2010 Extraordinary
and conferences. Currently he leads the research team at
learning, particularly unconstrained facial analysis in imag
Carlos Segura received the B.S. and M.S. degrees in Tele
the M.S. degree at the Technical University of Berlin (TU
in 2011. He worked as a research fellow at the TU-Berli
under the Torres Quevedo program as the Director of In
has participated in three national research projects and th
reviewed international journals and international confere
processing, computer vision and machine learning. 
Javier Rodríguez Saeta received the B.S., M.S. and Ph.D.
UPC, Barcelona (Spain), in 20 0 0 and 20 05, respectively. H
Catalonia (UOC), and the MBA by ESADE Business School. 
Biometric Technologies, S.L., in Barcelona (Spain), where h
company. He has published more than 20 papers in diffe
include all issues related to innovation, security and biom
Javier Hernando received the M.S. and Ph.D. degrees in te
Spain, in 1988 and 1993, respectively. Since 1988, he has
Professor and a member of the Research Center for Lang
ogy Laboratory, Santa Barbara, CA, during the academic y
speaker veriﬁcation and localization, oral dialogue, and m
book chapters, review articles, and conference papers on 
Prof. Hernando received the 1993 Extraordinary Ph.D. Awa
276 C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 
r Engineering from the University of Pisa in 2003. In 2007 he received the PhD from the 
time scheduling, operating systems and programming models. He has collaborated with 
tches integrated in the oﬃcial Linux kernel. 
eering at University of Pisa in 20 0 0 with a graduation thesis developed at the ReTiS Lab- 
ent of the modular real-time kernel SHaRK. He obtained the PhD from Scuola Superiore 
erprise project, an open-source RTOS which recently reached the OSEK/VDX certiﬁcation, 
versities. Since 2002 he is CEO and founder of Evidence Srl, a SME working on operating 
industrial products in the automotive and white goods market. Since 2011 he is President 
ons for the white goods market. His research interests include development of hard real- 
rocessor systems, object-oriented programming, real-time operating systems, scheduling 
es, Home&Building Automation (present). EMEAS industrial Deployment within Schneider 
ious Schneider Electrics units (2010-2001). Gewiss SPA laboratory Manager (20 0 0-1996). 
 of Vimar S.p.A., Standalone and Home and Building Automation products (present). R&D 
UX and embedded PC development group at SELCA S.p.A. (1992–2001); Project Validation 
lectronic Engineering from Politecnico of Torino in 1990, with specialization in software 
ing at University of Padua and in 2011 he obtained his M.S degree in Electronic Engineer- 
ogy Transfert Team T3LAB, in Bologna, and co-founded the FPGA Group. He did research 
hine vision and developed commercial solutions for processing multimedia data stream 
tions and heterogenic multi-core system-on-chip solutions. He joined the electronic R&D 
mainly focused on human interaction with smart home systems. 
ersit di Siena and Co-founder UDOO (Present). Director ’Academy of Digital Arts and 
 Association of Cognitive Ergonomics (20 0 0 - 20 06). Member di WG30 NATO Human 
er of the Programme Incitatif de Recherche sur lEducation et la Formation (PIREF) of the 
actor Group of the Italian National Railways (1996–1999). Liaison for Apple Inc. for the Claudio Scordino received the Master Degree in Compute
same university. His main research interests include real-
the Linux kernel community since 2008 having several pa
Paolo Gai , CEO, graduated (cum laude) in Computer Engin
oratory of the Scuola Superiore SantAnna on the developm
Sant’Anna in 2004. Since 2000, he founded the ERIKA Ent
and which is currently used by various industries and uni
systems and code generation for Linux- and ERIKA- based 
and founder of SSG Srl, providing hardware turnkey soluti
time architectures for embedded control systems, multi-p
algorithms and multimedia applications. 
Pierluigi Passera , R&D Director of Vimar SPA, wiring devic
Electric (2012-2010). R&D and Production Director in var
ABB SACE basic research department. 
Alberto A. Pomella , Electronics & Software R&D Manager
Director at CRS (20 01–20 03), Home automation Products; 
Group for consumer PC at ASEM (1991–1992). Degree in E
development and industrial automation. 
Nicola Bettin earned his B.S degree in Electronic Engineer
ing at University of Bologna. In 2012 he joined the Tecnol
in the design of a standard HW/SW architecture for mac
in embedded systems. His main interests were FPGA solu
dept. at Vimar Group in 2015 and his research activity is 
Antonio Rizzo Full Professor of Interaction Design, Univ
Science’ - ArsNova (20 0 0 - 20 09). Chair of the European
Factors and Human Reliability Group (1999 2002). Memb
French Government (20 02 - 20 03). Head of the Human F
Apple Design Project (1996 - 1997). 
C. Álvarez et al. / Microprocessors and Microsystems 47 (2016) 262–277 277 
 Information Engineering, University of Siena, Italy. He was Research Associate at the Uni- 
 in Computer Engineering and his Master in Electronics Engineering, Summa cum Laude 
 the European Project AXIOM. He coordinated the TERAFLUX project in the area of Future 
 is participating in the European projects HiPEAC (High Performance Embedded-system 
le Architectures). He contributed to SARC (Scalable ARChitectures), ChARM (performance 
is current interests include Computer Architecture themes such as Embedded Systems, 
Characterization. Roberto Giorgi is an Associate Professor at Department of
versity of Alabama in Huntsville, USA. He received his PhD
both from University of Pisa, Italy. He is the coordinator of
and Emerging Technologies for Teradevice Computing. He
Architecture and Compiler), ERA (Embedded Reconﬁgurab
evaluation of ARM-processor based embedded systems). H
Multiprocessors, Memory System Performance, Workload 
