Abstract-People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.
I. INTRODUCTION
We are entering the Cyber-Physical age, in which both objects and people will become nodes of the same digital network for exchanging information. Therefore, the expectation is that "things" or systems will become somewhat smart as people, having to permit a rapid and close interaction not only human-human and system-system, but also human-system, and system-human. More scientifically, we expect that such CyberPhysical Systems (CPS) will at least react in real time, provide enough computational power for the assigned tasks, consume the least possible energy for such tasks (energy efficiency), allow for an easy programmability, scaling through modularity and exploit at best existing standards at minimal costs.
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new hardware/software architectures for CPSs in which the above expectations are possibly realized. The project, started on February 2015, will span over 3 years. The coordination of the project is carried out by the University of Siena (UNISI). UNISI also takes the evaluation part of the project. Foundation for Research and TechnologyHellas (FORTH) develops the interconnection between boards. Barcelona Supercomputing Center (BSC) is responsible of the OmpSs programming model and software toolchain. Partner EVIDENCE takes the lead on the development of the runtime systems. Partner SECO designs and builds the prototype board. Partner HERTA Security provides a video-surveillance use case. Partner VIMAR provides a smart-building use case.
The specific objectives of the AXIOM project are:
• Realizing a small board that is flexible, energy efficient and modularly scalable. We will use an ARM-and FPGAbased chip with custom high-speed interconnects to build the AXIOM prototype board.
• Easy programmability of multi-core, multi-board, FPGA node, with the OmpSs programming model, and improved thread management and real-time support from the operating system. The software will be Open-Source.
• Easy interfacing with the Cyber-Physical world, based on the Arduino shields [1] , pluggable onto the board.
• Contribute to standards, in the context of the Standardization Group for Embedded Systems (SGET) and OpenMP.
These are the expected impacts obtained from the AXIOM project:
• Platform interfacing with the physical world. The AX-IOM project intends to create a platform that enables the module designed to interact with the physical world through compatibility with Arduino Shells [2] .
• Production platform. The AXIOM design is aimed to become the hardware and software platform for large scale production.
• Development of autonomous technology. Allow this technology to break the Embedded Systems energy efficiency and programmability barriers. The same set of technologies are expected to represent the base for future European industrial exploitation in the HPC and Embedded Computing markets.
• Provide the basis for a new European-level research at the forefront of the development of extreme-performance system software and tools.
II. APPLICATION DOMAINS AND EXAMPLES OF USE
Axiom will be applied in two real life application domains: Video-surveillance and Smart-home. They will operate as benchmarks for assessing the potentialities and the limits of the proposed architecture. The two application domains have been chosen for the different kind of challenges to process capabilities they pose.
A. Video-surveillance
Intelligent multi-camera video-surveillance is a multidisciplinary field related to computer vision, pattern recognition, signal processing, communications, embedded computing and image sensors. Smart video-surveillance has a wide variety of applications both in public and private environments, such as homeland security, crime prevention, facial marketing and traffic control, among others.
These applications are generally very computationally demanding, since they require monitoring very diverse indoor and outdoor scenes including airports, hotels or shopping malls, which usually involve highly varying environments. In many cases it is also necessary to analyse multiple camera video streams, particularly when object re-identification or tracking of individuals across cameras is required. Figure 1 shows an scenario where people may be observed and recognized with different objectives: statistics, traffic control, accident prediction and detection, etc. Another crowded scenario may be the case of a large company with hundreds of employees that work in different places/buildings: an employee A in any room requests videoconference with a colleague (in any place, any building) and AXIOM, in real-time, detects where this person is and requests him or her permission to begin videoconference room-to-room by telling that person: A is requesting a video-conference. Real-time recognition may also help to track emergency vehicles to skip traffic jams by analysing the traffic camera images in real-time. 
B. Smart-home
Smart home means buildings empowered by ICT in the context of the merging Ubiquitous Computing and the Internet of Things: the generalization in instrumenting buildings with sensors, actuators, cyber-physical systems allow to collect, filter and produce more and more information locally, to be further consolidated and managed globally according to business functions and services. A smart home is one that uses operational and IT technologies and processes to make it a better performing building -one that delivers lower operating costs, uses less energy, maximizes system and equipment lifetime value, is cyber-secured and produces measurable value for multiple stake holders.
Major challenges in such environments concern cryptography, self-testing and first of all sensor-networks management. Sensor data brings numerous computational challenges in the context of data collection, storage, and mining. In particular, learning from data produced from a sensor network poses several issues: sensors are distributed; they produce a continuous flow of data, eventually at high speeds; they act in dynamic, time-changing environments; the number of sensors can be very large and dynamic. These issues require the design of efficient solutions for processing data produced by sensornetworks. Figure 2 shows different home and office scenarios where AXIOM can help with preventive and interactive maintenance of infrastructures, climate and temperature management. This management can be remotely controlled helping to improve the energy efficiency at home, apartments and company office buildings. For instance, AXIOM may detect patterns of behaviour in a company office building to adapt climate and light switching to the working way of life of the workers.
The two application domains pose also common challenges such us, board-to-board communication and easy programmability. Furthermore, the two scenarios shown can easily converge, offering opportunities for synergies and emerging services in the respective domains.
C. Examples of Use
We are currently considering a wide range of potential uses both for Video-surveillance and for Smart-home. They range and from Smart home comfort to Autonomous drone for infrastructure control, for Smart-home. Here a taste of the scenarios, where the goals are expressed in term of the final users of the enabling technology. We discuss in another paper a scenario, part of our scenerio exploration, related to vehicle detection [3] . At the same time, these goals should match with the challenges to AXIOM processing capabilities:
• Dynamic retail demand forecasting. Due to the high fluctuation of passengers departing and arriving at train stations, demand for station retailers varies strongly over time. By forecasting such demand through video analysis, better services can be provided through more efficient staff utilisation. The purpose of this scenario is to provide retailers with a real-time forecast of potential customers arriving at their outlets, to allow for better task allocation and to increase business efficiency.
• Smart marketing in shopping mall. Consumer behaviour in a shopping mall can be very eclectic yet the awareness of patterns of behaviour can be of help both to services providers and to clients to meet their respective goals. Demographic analyses is carried out over the captured facial snapshots, helping to identify interesting facts such as the demographic profiles of the customers, or how do they distribute into gender and age segments. The visitors are tracked from one camera to another, so as to discover the main paths they take through the mall and how long they stay at different locations. The goal is to collect statistical information about the visitors in order to define marketing strategies both for service providers and for clients.
• Smart home comfort. Comfort perception and necessities can be different in respect of time of the day/week and to the characteristics of the people actually living that space in that moment. The smart home is required to identify and manage the different situations, and to react at the people indications in an easy and smooth way. Networked sensors and actuators are distributed in each room embedded in ordinary appliances. The appliances perform their primary normal function, but also collect different kinds of information, ranging from presence detection, temperature, humidity, window and door opening, air quality, audio. The objective of the smart home comfort autopilot is to minimize power consumption and to guarantee peoples comfort and well being, without giving the impression of reducing people freedom and capacity of control.
• Autonomous rover/drone for infrastructure control. Preventive maintenance is performed on equipment to keep it running smoothly and efficiently and to help extend its life. Many types of equipment should be put on a preventive maintenance program: HVAC systems, pumps and air compressors, air conditioning, chillers and absorption equipment, elevators, safety showers, back-flow preventers, building exteriors, roofs, windows, fire doors and generators. Autonomous rovers and drone furnished with thermo camera and ambient sensors can move inside and outside a building monitoring the energy flow, providing data for a multi-level energy flow models that can be used for preventive maintenance. The goal is maintaining building infrastructure efficient, manage operating costs, and minimizing potential downtime. It also ensures these components perform within their originally designed operating parameters, allowing data center managers the opportunity to replace components before they fail.
The modular approach explored by AXIOM is particularly well-suited for tackling such challenging scenarios, as it addresses the issues derived from their computational complexity, distributed nature, and need for synchronization among processes. Moreover, we are considering some representative benchmarks to test drive the design of the software stack that two partners already explored in the ERA project [4] , [5] .
III. THE AXIOM PLATFORM
The AXIOM platform will be built around FPGA-based SoCs, as exemplified by the Zynq platform by Xilinx. Zynq devices feature a dual-core ARM Cortex A9 processor running at 600-800 MHz speed closely connected to an FPGA fabric. The closeness of the connection (and hence the low latency) and the flexibility of the reconfigurable FPGA logic make the combination very powerful in terms of customisation. In addition, Zynq devices feature gigabit-rate transceivers that will be used to provide ample communication bandwidth between AXIOM nodes.
In terms of connectivity, AXIOM -besides including classical connectivity (e.g., Internet) -will also bring modularity a the next level, allowing the construction of more compute intensive and high-performance systems through low-cost but scalable high-speed interconnect. This interconnect, subject of research and design during the project, will utilise relatively low cost SATA connectors to interconnect multiple boards. Such connectivity will allow to build (or upgrade at a later moment) flexible and low-cost systems with simplicity by re-using the same basic (small) module without the need of costly connectors and cables.
We will provide three bi-directional links per board, so that the nodes can be connected in many different ways, ranging from ring, to the well-established 2D-mesh/torus, and up to arbitrary 3-D topologies such as mesh/torus. We envision the AXIOM interconnect to have customisable parameters (such as packet size, formats, etc) if needed by applications, further improving the efficiency and performance.
In AXIOM we will provide a powerful network interface (N/I) -implemented in the FPGA region -that will efficiently support the communication protocols needed by the applications. Besides implementing a MPI-like communication library, we will support a (distributed) Shared Memory model with support from the OmpSs programming model, the Operating System, and the Runtime. One such optimisation is the efficient implementation of RDMA and remote-write operations as basic communication primitives visible at the application level.
IV. THE OMPSS PROGRAMMING MODEL
Several solutions have been proposed during the last decades to parallelize computations on multi-core systems. However, no unanimous consensus on the best solution has been achieved. On one hand, some solutions are based on message-passing mechanisms (e.g., MPI), which are too difficult to use for developers not accustomed to parallel programming. For example, an existing legacy code cannot be easily parallelized using this approach. On the other hand, easy programming models have been proposed, but they usually rely on a shared memory subsystem (e.g., SMP systems).
AXIOM will leverage OmpSs, a task dataflow programming model that includes heterogeneous execution support as well as data and task dependency management [6] and has significantly influenced the recently appeared OpenMP 4.0 specification. In OmpSs, tasks are generated in the context of a team of threads that run in parallel. OmpSs provides an initial team of threads as specified by the user upon starting the application.
Tasks are defined as portions of code enclosed in the task directive, or as user-defined functions, also annotated as tasks, as follows:
A task is created when the code reaches the task construct, or a call is made to a function annotated as a task. The task construct allows to specify, among others, the clauses in, out and inout. Their syntax is:
The information provided is used to derive dependencies among tasks at runtime, and schedule/fire a task. Tasks are fired when their inputs are ready and their outputs can be generated.
Dependencies are expressed by means of data-referencelists. A data-reference in such a list can contain either a single variable identifier, or also references to subobjects. References to subobjects include array element references (e.g., a [4] 
OmpSs is based on two main components: i) The Mercurium compiler gets C/C++ and FORTRAN code, annotated with the task directives presented above, and transforms the sequential code into parallel code with calls to the Nanos++ runtime system; and ii) The Nanos++ runtime system gets the information generated by the compiler about the parallel tasks to be run, manages the task dependences and schedules them on the available resources, when those tasks are ready. Nanos++ supports the execution of tasks in remote nodes, and heterogeneous accelerators.
At the lower level, the AXIOM project will investigate and implement the OmpSs programming model on top of the following intra-and inter-node technologies:
• Intra-node: The most important target here is FPGA programmability support.
-OmpSs@FPGA, for easy exploiting of the FPGA acceleration;
• Inter-node: In this case two different approaches can be addressed based on the performance requirement, although they can be integrated in the same scenarios, to work with different memory address spaces.
-OmpSs@cluster, for efficient parallel programming hiding message-passing complexities; -OmpSs on a DSM-like paradigm, for easy parallelization of legacy code. Figure 3 shows the overall view of OmpSs@FPGA and OmpSs@cluster execution context in a multi-board system. Each FPGA-based node will be addressed by the OmpSs@FPGA suport meanwhile the OmpSs@cluster will help to transparently program all the multi-node system. Figure 4 shows the overall view of a DSM system where OmpSs@FPGA would have the same intra-node influence and OmpSs@cluster will appear like a single intra-node OmpSs running over a transparent DSM system.
A. OmpSs@FPGA
The OmpSs@FPGA ecosystem consists of the infrastructure for compilation instrumentation and execution from source code written in C/C++ to ARM binary and FPGA bitstream for Zynq. The compilation infrastructure provides support to (1) generate ARM binary code from OmpSs code, that can run in the ARM-based SMP of the Zynq All-Programmable SoC, (2) extract the kernel of the part of the application to be accelerated into the FPGA and (3) automatically generate a bitstream that includes the IP cores of the accelerator(s), the DMA engine IPs, and the necessary interconnection. In addition, the ARM binary can be instrumented to generate traces to be analyzed offline with the Paraver tool [7] .
The runtime infrastructure should allow heterogeneous tasking on any combination of SMPs and accelerators, depending on the availability of the resources and the target devices. Figure 5 shows the high level compilation flow using our OmpSs@FPGA ecosystem. The OmpSs code is passed through the source-to-source compiler Mercurium [8] , that includes a specialized FPGA compilation phase to process annotated FPGA tasks. For each of those tasks, it generates two C codes. One of them is a Vivado HLS 1 annotated code for the bitstream generation branch ("accelerator codes" box 1 Source to HDL Xilinx tool in Figure 5 ). The other is an intermediate host source code with OmpSs runtime (Nanos++) calls that is generated for the software generation branch ("Host C code + Nanos++ runtime call" box in Figure 5 ). Both the hardware and the software generation branches are transparent to the programmer. Figure 6 shows a matrix multiply example that has been annotated with OmpSs directives. This code shows a parallel tiled matrix multiply where each of the tiles is a task. Each of those tasks has two input dependences and an inout dependence that will be managed at runtime by Nanos++. Those tasks will be able to be scheduled/fired to an SMP or FPGA, as it is annontated in the target device directive, depending on the resource availability. The copy_deps clause associated to the target directive hints the Nanos++ runtime to copy the data associated with the input and output dependences to/from the device when necessary. 
]) v o i d m a t r i x m u l t i p l y ( i n t BS , f l o a t a [ BS ] [ BS ] , f l o a t b [BS ] [ BS ] , f l o a t c [BS ] [ BS ] ) { f o r ( i n t i a

Master node
Slave N memory Slave 1 memory Fig. 7 : Nanos++ distributed memory management organization
B. OmpSs@cluster
OmpSs@cluster is the OmpSs flavor that provides support for a single address space over a cluster of SMP nodes with accelerators. In this environment, the Nanos++ runtime system supports a master-worker execution scheme. One of the nodes of the cluster acts as the master node, where the application starts. In the rest of nodes where the application is executed, worker processes just wait for work to be provided by the master.
In this environment, the data copies generated either by the in, out, inout task clauses are executed over the network connection across nodes, to bring data to the appropriated node where the tasks are to be executed.
Following the Nanos++ design, cluster threads are the components that allow the execution of tasks on worker nodes. These threads do not execute tasks themselves. They are in charge of sending work descriptors to their associated nodes and notifying when these have completed their execution. One cluster thread can take care of providing work to several worker nodes. In the current implementation, cluster threads are created only on the master node of the execution. Slave nodes cannot issue tasks for remote execution and thus they do not need to spawn cluster threads.
In Nanos++, the device specific code has to provide specific methods to be able to transfer data from the host address space to the device address space, and the other way around. The memory coherence model required by OmpSs is implemented by two generic subsystems, the data directory and the data cache, explained below. Figure 7 shows how the different Nanos++ subsystems are organized to manage the memory of the whole cluster. The master node is the responsible for keeping the memory coherent with the OmpSs memory coherence model, and also for offering the OmpSs single address space view. The master node memory is what OmpSs considers the host memory or host address space, and it is the only address space exposed to the application. The memory of each worker node is treated as a private device memory and is managed by the master node.
The data cache component manages the operations needed at the master node to transfer data to and from worker memories. There is one data cache for each address space present on the system. Operations performed in a data cache include allocating memory chunks, freeing them and transferring data from their managed address spaces to the host address space and the other way around. Data caches also keep the mapping of host memory addresses to their private memory addresses. Memory transfer operations are implemented using network transfers. Allocation and free operations are handled locally at the master node.
A memory reference may have several copies of its contents on different address spaces of the system. To maintain the coherence of the memory, the master node uses the data directory. It contains the information of where the last produced values of a memory reference are located. With it, the system can determine which transfer operations must perform to execute a task in any node of the system. Also, each task execution updates the information of the data directory to reflect the newly produced data.
The implementation of the network subsystem is currently based on the active messages provided by the GASNet communications library. In the context of AXIOM, we will adapt the networking on the communications library provided for the Zynq platform.
C. OmpSs on DSM-like systems
DSM is a well-known research topic, and it can be implemented either at software or at hardware level (with a full range of hybrid approaches) and it is a recently revived topic [9] . Some attempts for creating Software DSM implementations for Linux have been carried out during the last decades. Examples are Treadmarks (TMK), JIAJIA [10] , Omni/SCASH [11], [12] , Jump [13], [14] , Parade [15] , [16] , NanosDSM [17] . Some of these projects only supported very specific hardware, and none of them has been maintained during the last decade. We will work on the design and development of a proper, reliable and efficient mechanism to implement a DSM-like paradigm integrated in the Linux OS. The mechanism will run on the reference platform. It will allow to leverage the simplicity and scalability of the OmpSs framework on top of the AXIOM platform. It will be released as Open-Source software, and it is expected to bring benefits to both the ICT and the embedded industries.
V. OPERATING SYSTEM SUPPORT
The operating system will be Linux. We will investigate the possibility of integrating features in the OS to allow a balancing of the work across the basic modules through the highspeed interconnection. Current solutions for work balancing across distributed systems are expensive, too specific, or too difficult to program (with paradigms such as MPI): finding an efficient solution is an aimed outcome of the project. Particular attention will be given to scalability and latency issues, by implementing lock-free data structures. Another relevant aspect will be the necessity of properly managing events in real-time.
The OS scheduler will be extended to enable it distributing threads across the different computing elements. The low-level thread scheduler (LLTS [18] , [19] , [20] , [21] , [22] , [23] , [24] ) will be accelerated in hardware, by mapping its structure in the FPGA cards composing the evaluation platform. This will avoid bottlenecks from the scheduler, thus increasing the performance of parallel applications.
VI. PRELIMINARY EXPERIMENTS
We have done preliminary experiments for coding a set of benchmarks in the Zynq platform and an initial evaluation of programmability cost in terms of number of lines of code, as a measure of programmibility complexity.
A. Benchmarks description
We have used 3 benchmarks for a preliminary analysis of easy of programming when using the OmpSs@FPGA infrastructure:
• Cholesky matrix decomposition, working on a dense matrix of 64x64 double-precision complex numbers.
• Covariance, working on arrays of 32-bit integer complex numbers.
• Matrix multiplication, working on a matrix of single precision floating point values. We have used two different problem sizes for this benchmark: 32x32 and 64x64 matrix sizes. We have implemented four different versions of each code: sequential code, pthread code, FPGA-accelerated code and OmpSs code. All versions of the codes consider the full Matrix Multiply, the full Cholesky, and the full Covariance as tasks.
B. Programmability analysis
We want to remark the programmability facilities of our proposal. With this objective, Table I shows the total number of additional lines of code for each of the different versions of the applications, compared to the sequential version: a pthread version only running tasks in one or two ARM cores (Pthread), a sequential version using one or two hardware accelerators (Accel), and the OmpSs version (OmpSs).
The Pthread and Accel versions require more additional lines than the OmpSs version. This is specially high in the sequential versions using the hardware accelerators. For the Pthread version this is due to the additional calls to the Pthreads library, in order to create, manage and join the pthreads. For the Accel version, this is because the application needs to call the low-level infrastructure to setup the communications layer with the FPGA and perform the actual data transfers back and forth to the FPGA hardware.
On the other hand, in the case of the OmpSs version, the thread management, the setup of the communications and data transfers to and from the FPGA are all done internally by the Nanos++ runtime. This way, the programmer does not need to write any line of code related to low level management, but only the directives triggering the communications. to a baseline C implementation, for the pthread version using the SMP cores only (Pthread), the sequential code using the hardware accelerators through the DMA lib (Accel), and the OmpSs version, for the four benchmark configurations.
Indeed, the current compilation and runtime infrastructure of the OmpSs programming model allows to exploit the heterogeneous characteristics of the Zynq All-Programmable SoC with the only effort of two directive lines. Note however that Table I indicates that the OmpSs version needs an additional third line. This line is a taskwait before the program ends, as it can be observed in Figure 6 . Actually, the code showed in Figure 6 is used to generate both the 32x32 and the 64x64 versions of the matrix multiplication, using all the available resources (ARM cores and FPGA), simply by redefining the BS variable as 32 or 64 elements.
For the Pthreads and Accel versions however, different block sizes need new scheduling schemes, adding more complexity to the transformation of the code. Indeed, implementing a fourth version of the code managing heterogeneous executions would require more development time and additional lines that the ones showed in Table I .
VII. RELATED WORK
The AXIOM project will exploit the OmpSs dataflow features in the AXIOM heterogeneous architecture. OmpSs is the result of the integration of StarSs [25] and OpenMP.
In this Section we discuss some work that has been fundamental for the development of this project and provided the necessary inspiration and vision to develop some basic concepts related to the dataflow execution model. Dataflow execution model had been studied since long time ago [26] as they provide a simple an elegant way to efficiently move data from one computational thread to another one [27] , [28] . In the context of the TERAFLUX project [22] , [23] such dataflow model had been extended to multiple nodes executing seamleassly thanks to the support of an appropriate memory model [19] , [24] . In such memory model a combination of consumerproducer patterns [20] , [21] and transactional memory [29] , [30] permits a novel combination of dataflow concepts and transactions in order to address the consistency across nodes, where each node is assumed to be cache-coherent, i.e., like in a classical multi-core. Dataflow models also allows the system to take care in a distributed way of faults that may compromise a node [31] , [32] .
VIII. CONCLUSIONS AND FUTURE WORK
In this paper, we have presented the software layers that we are developing on the AXIOM H2020 European Project. The main objective of the project is to bring to reality a novel small board which aims at becoming a very powerful basic brick of future interconnected and scalable embedded Cyberphysical systems, and specifically we focus on the application domains of Video-surveillance and Smart-home. The module consists of both hardware and software that will be designed and demonstrated in the project.
On one hand, the target board architecture will be a board based on a SoC with several ARM cores and an FPGA, like the Xilinx Zynq, and with the Arduino interface to be extensible. The AXIOM system will comprise several of such boards linked through custom communication links, and providing application memory coherence at software level. On the other hand, we will research ways to easy programmability of the system, based on the OmpSs programming model and DSM-like techniques to achieve a global system image for applications.
Our preliminary experiments have shown that the OmpSs programming model increases the expressiveness of serial or pthreads programming, thus allowing developers to focus on solving the issues related to the algorithms, instead of dealing with the low-level details of the communications among boards or data transfers between the cores and the embedded FPGA.
The key features of the project presented in this paper are the possibility to modularly enhance the capabilities of the board, improve its interface with the physical world, flexibly reconfiguring it for accelerating specific functions, while providing energy efficiency and easy programmability.
Currently, we are in the process of designing a high-speed communications layer between boards. These communication will be implemented using the transceivers available in the Zynq SoC. We have also started looking at the application requirements to ensure that our platform fits with their needs.
