Abstract-NanoStreams explores the design, implementation, and system software stack of micro-servers aimed at processing data in-situ and in real time. These micro-servers can serve the emerging Edge computing ecosystem, namely the provisioning of advanced computational, storage, and networking capability near data sources to achieve both low latency event processing and high throughput analytical processing, before considering off-loading some of this processing to high-capacity datacentres. NanoStreams explores a scale-out micro-server architecture that can achieve equivalent QoS to that of conventional rack-mounted servers for high-capacity datacentres, but with dramatically reduced form factors and power consumption. To this end, NanoStreams introduces novel solutions in programmable & configurable hardware accelerators, as well as the system software stack used to access, share, and program those accelerators. Our NanoStreams micro-server prototype has demonstrated 5.5× higher energy-efficiency than a standard Xeon Server. Simulations of the microserver's memory system extended to leverage hybrid DDR/NVM main memory indicated 5× higher energyefficiency than a conventional DDR-based system.
I. INTRODUCTION
Instant access to data and real-time data analytics have the potential to catalyse knowledge acquisition, discovery, and responsiveness. Unfortunately, accessing and processing data with low latency and high bandwidth pushes the computational, storage, and networking resources available in high-capacity datacentres to their extremes, due to massive demand [1] . An important change in the computing ecosystem that can help alleviate the pressure of data volume and velocity from high-capacity datacentres is to process data at or near their sources, the so called 'Edges' of the Internet.
NanoStreams is a European Seventh Framework Programme research project that explores the design, implementation and software stack of lean micro-servers that can ingest and process data in-situ at the edges of the network, with realtime guarantees. The project's vision is to achieve energyefficient processing of concurrent data streams at or near the data sources, thus reducing the latency and energy footprint of real-time data analytics. The project explores a scale-out micro-server architecture to achieve QoS equivalent to that of more conventional rack-mounted servers for high-capacity datacentres, but with dramatically reduced form factors and power consumption. A distinguishing aspect of NanoStreams is that it builds real-silicon prototypes for field deployment. Indicatively, NanoStreams has already demonstrated a successful deployment of a micro-server at the Belfast Royal Victoria Hospital, in collaboration with the NHS and the local Health Trust, to monitor and analyse ICU respiratory data in real-time.
By placing more computational, storage and networking power near data sources, NanoStreams aims to achieve the low latency targets of modern analytics, off-load the brunt of data processing from warehouse scale datacentres, and enable emerging applications such as real-time video analytics, smart cities, grids and buildings, and 5G mobile data analytics. To achieve the aspirations of the project, NanoStreams adopts a hardware-software co-design approach that is common in the embedded systems domain, combined with an HPC system software stack [2] . NanoStreams innovates in multiple areas of the emerging micro-server ecosystem, by proposing:
• A new Analytics-on-Chip (AoC) architecture that enhances the computational capacity of micro-servers with configurable accelerators; the accelerators are based on the Nanocore (Section II), a configurable tiny core for FPGAs that is directly programmable in C.
• A novel bare-metal Ethernet networking infrastructure, the nanowire (Section III), which achieves low latency access and sharing of accelerators without affecting the host processor architecture [3] .
• A scalable streaming programming model (Section IV-A) embedded within domain-specific sequential programming environments, such as database environments and graph processing tools [2] .
• Non-volatile memory technology for the micro-servers, accessible directly as 'main memory' (Section IV-B) [4] .
• Workload-specific optimisation using the concept of isoquality of service (iso-QoS), applied to three use cases from the healthcare, capital markets, and business analytics sectors (Section V) [5] , [6] .
• A range of new methods to fairly compare the effi-ciency of server architectures (Section VI) and scale these architectures on demand to meet workload QoS requirements [5] , [6] . NanoStreams advances the state of the art in micro-servers in several ways by: (a) adding application-specific but programmable hardware accelerators to micro-servers, as opposed to existing solutions that use elaborate hardware design flows and target a single algorithm [7] ; (b) providing general purpose low latency networking to access accelerators in the datacentre, as opposed to custom fabrics [8] ; (c) effectively integrating streaming and accelerator-aware programming models into domain specific software stacks, moving one step ahead of ongoing efforts to unify heterogeneous programming models [9] ; (d) significantly improving server energy-efficiency of micro-servers via on demand and QoS-aware scale-out and acceleration [5] , [6] .
The NanoStreams micro-server prototype has demonstrated 5.5× higher energy-efficiency than a standard Xeon Server (Section VI). Simulations of the NanoStreams hybrid DDR/NVM memory system indicate 5× higher energyefficiency than a conventional DDR-based system.
II. NANOCORES: PROGRAMMABLE ENERGY-EFFICIENT ACCELERATION
The NanoStreams Analytics-on-Chip (AoC) architecture targets low latency stream processing of compute intensive tasks. It is an amalgam of low-power RISC processors for the embedded systems domains and Nanocores, a new class of programmable compute units. The AoC processor is a heterogeneous SoC that reduces latency in processing of streaming operators issued to the micro-server with the latency-optimised RISC cores, while improving analytical processing throughput on compute and data intensive tasks with the Nanocores. The AoC architecture is currently implemented on Xilinx Zynq-7000 family which offers SoC integration of a dual-core general purpose ARM Cortex A9 based processing system (PS) and 28 nm Field Programmable Gate Array (FPGA) programmable logic (PL). This level of integration reduces the communication latency between RISC cores and Nanocores and enables efficient on-chip parallelisation of analytical tasks.
Besides using the FPGA as the means to do architectural exploration for the AoC processor, our choice is also motivated by the excellent power efficiency [10] , [11] of FPGAs compared to other accelerators such as GPUs. The challenge is to design an architecture which is generic enough but fast, able to accelerate a wide class of analytics, while at the same time promote programmability, hardware specialisation through configurability and improve energy consumption.
A. Nanocores
Nanocores are a new class of programmable and configurable processors. The single core is designed to allow easy integration with a number of other Nanocores, which may not necessarily have the same feature set, to form a multicore platform. The multicore platform improves analytical processing throughput by exposing the parallel computational capabilities of the underlying hardware to the software domain and is sufficiently flexible to allow the acceleration of a wide range of applications.
Our Nanocore prototype supports 32-bit and 64-bit fixed point arithmetic. This configuration has been selected to demonstrate a key benefit over existing 32-bit only FPGA soft-core microprocessors (e.g. Xilinx Microblaze and Altera Nios II) and to utilise the fixed point DSP capability of the current generation of FPGAs. Besides word size, the Nanocore instruction set is also customisable at build time. The default instruction set is Turing complete and additional applicationspecific instructions can be added as required to provide improved performance. The key aspect of the approach is to optimise the underlying FPGA hardware to allow the creation of a light core which operates substantially faster than existing FPGA-based cores. 
1) Memory Hierarchy:
The block diagram of a single Nanocore is given in Fig. 1 . In order to minimise latency, the input and output memories are configured as first-in-firstout buffers (FIFOs). There are 16 registers (R0-R15) within each processor. Each core has an area of read/write addressable memory for storing intermediate calculation results and use as a stack (the scratch memory). This ensures that each core is capable of relatively complex behaviour and enables the core to support operations where data has to be spilled out of the internal registers.
2) Instruction Word: The initial implementation of Nanocore operates on a 32-bit instruction word; whilst the Zynq block RAM allows to have 36 bit instruction words, four bits are kept free for future use. The instruction set has been designed to allow the core to execute four operations within a single clock cycle: input read, output write, always jump and either a constant load or another instruction. This multi-operation execution in a single instruction reduces the number of instructions required for tight loops, resulting in increased throughput of stream processing.
3) Hardware Resources and Power Consumption: Table I summarises the resource usage for one Nanocore. The goal of the project is to make the core as lightweight as possible with the minimum instruction set required for the application domain. The per-core power estimates quoted were observed over a long period of time for various programs. These numbers are important for estimating the scale out performance of Nanocores on larger devices. Fig. 2 graphs the achievable maximum frequency of Nanocores against the number of cores and the 32-bit and 64-bit variants. These results have been obtained using Xilinx Vivado 2014.3.1 suite with implementation defaults and with a clock constraint of 312.5 MHz. Most of these results exceed this frequency, thus the implementation stops optimisation once the constraint is met. This menans that the drop in frequency for the 32-bit Nanocore in the dual-core design can be ignored.
4) Multicore Architecture: Our prototype supports Single Instruction Multiple Data (SIMD) processing, with each Nanocore operating on a different burst of data and returning a burst of results. Individual Nanocores need not have visibility of the global, system data flow to operate. Instead, we include additional elements in our hardware architecture to control the input and output data streams and implement run-time configurability. We call them scatter and gather modules ( Fig. 3 ) and they also control multiplexing and demultiplexing of data from or to different sources. This enables the NanoStreams goal of operating seamlessly on streaming data and achieve high throughput. The existing architecture can be configured at build time to support a variable number of Nanocores with different word sizes and different instruction sets. The Nanocore design will also allow run-time configuration of inter-core data routing, providing the capability for the AoC architecture to dynamically adjust to workloads.
5) Programmability:
The general purpose ARM core works on low latency transactional processing tasks and offloads analytical tasks to the accelerators. The exact distribution of functionality can vary for different applications and can be programmed accordingly using the NanoStreams high level programming model. The ARM core also acts as master for controlling the data-flow through the Nanocore fabric, as well as having the ability to program and reprogram the Nanocores. A library has been written to support communication and control of Nanocores via the ARM processor. This library provides functions such as initialisation, programming and control of transfer of data to and from the multicore architecture.
A beta version of a C99-compliant compiler for Nanocore has been developed by ACE using the CoSy compiler development system. Along with compiling C to the default 26 instruction set supported by Nanocore, the compiler provides support for the Nanocore special instructions for high speed input read and output write.
Finally, to support faster time-to-market, we develop a testing and development platform, hardware-in-loop (HIL). HIL leverages ARM-Nanocore control and the communication library for its functionality. It runs on the ARM core to stimulate a real Nanocore running on the FPGA. This testing application interface allows the programmer to access the Nanocore's data and control interfaces through wrapper functions. These functions give a high level of access by abstracting the hardware interface around the Nanocore and allow for significantly faster hardware verification in real-time.
III. NANOWIRE: BARE METAL ETHERNET
Nanowire is our communication protocol among hosts and accelerators. Besides efficiency, Nanowire primarily aims to enable shared accelerators, e.g., at the node level or rack level for cost reason and also decouple accelerators from the host/server technology cycle, as accelerators evolve at a different pace compared to servers.
Ethernet interconnects have dominated the datacentre. Ethernet performance has been constantly scaling with technology, and it is widely used by servers. For these reasons, we base Nanowire on raw Ethernet. Fig. 4 provides an overview of the overall architecture and design but also identifies the datapaths established between the host and accelerator nodes during an end-to-end communication transaction. Next, we discuss the main aspects of Nanowire in more detail.
A. Nanowire Protocol
The main functions of the Nanowire communication substrate are: (1) a simple and convenient API to virtualise and manage accelerators; (2) reliable, high-throughput and lowlatency transfers; and (3) minimal host-cpu overheads while supporting high concurrency. Nanowire (Fig. 4) is composed of two abstractions: the Host-Accelerator Transport (HAT) layer that handles networking aspects and the Task Issue Protocol (TIP), a task queue layer that issues task requests from the hosts and receive task results from the accelerators.
HAT provides the network tier of NanoStreams and offers a common abstraction of the network level services and I/O primitives to both the host and accelerator nodes. HAT code runs on both sides of the interconnect and although the host side makes use of the host OS services, on the accelerator side, HAT runs on more diverse platforms. In our prototype, the accelerator side of HAT is implemented as custom firmware directly on top of the A9 core in the Processing System (PS) of the FPGA card.
HAT allows multiple hosts to share the same accelerator, supports variable size packets (up to the Ethernet MTU size), and also supports reliable transmission. Packets can be switched via Ethernet switches, however, HAT provides the ability to further customise the Ethernet header for efficiency, when switching is not required.
Finally, HAT provides lightweight connection-less channels as the lowest-level communication. A channel consists of a point-to-point unidirectional queue of packet slots used for communication between a source host node and a destination accelerator node. Resources per channel, e.g., slot size, are chosen at creation time. Channels aim at providing a lowoverhead and low-latency communication path, while allowing the system to tune the allocation of resources for each channel.
On the host side, HAT implements channels based on userkernel shared memory to eliminate expensive system calls and achieve low-latency, similar to the approach taken in [12] . This approach has been examined in the past when designing low-latency networks [13] , [14] , [15] , [16] but eventually has not been efficient due to the cost of selective spinning for "fat" cores. However, the trend towards thin and tightly packed micro-servers makes this approach attractive.
TIP provides the runtime system with the ability to trans- parently issue tasks to the accelerators without any knowledge of the underlying network infrastructure or the accelerators themselves. TIP implements a simple client-server protocol to decouple analytics kernel invocation from execution: it utilises HAT channels to enqueue kernel service requests to remote accelerator nodes and retrieve completions/replies. TIP uses a task descriptor to identify the service/kernel to be executed on the remote accelerator node and the corresponding completion notification and result. It supports both blocking and non-blocking interfaces. The protocol is designed to perform end-to-end flow-control and reliable transmission via message retransmission whenever an error is detected. For error detection, we use the checksum mechanism available in Ethernet NICs. Fig. 5 illustrates preliminary results for the current prototype of Nanowire. Round trip latency is about 33 μs for the full path, including the network links and NICs. About 11.5 μs is the cost of the protocol on the accelerator (ARM core), 15.6 μs is the host overhead (8.9 μs in the issue path and 6.7 μs in the receive path), and about 6 μs is spent on the network interfaces and the wire.
B. Preliminary performance analysis
We are currently working on adaptive interrupt and context switching reduction techniques [17] via interrupt coalescing or polling (NAPI) in the Ethernet driver and the NIC itself. Such techniques are particularly effective when there are concurrent tasks in the system. Additionally, we would like to evaluate in more detail the impact of the shared structure between user and kernel space on system performance and to examine alternatives, as well as to use a custom path in the Linux kernel avoiding the overhead of passing through the netdev interface.
C. Summary
Nanowire provides an efficient, transparent, and flexible transport between hosts and accelerators. It implements reliable, low-latency, high-throughput, and low-overhead communication channels between the host runtime and shared, application specific cores. We envision that Nanowire will be used to connect accelerators to hosts, both at the board and the rack levels.
IV. NANOSTREAMS STREAMING PROGRAMMING MODEL
We focus our discussion on two key aspects of the NanoStreams programming model and runtime environment: faithful C language extensions for supporting hybrid analyticaltransactional applications on streaming data; and memory management in heterogeneous, non-volatile memories, which we consider as a viable and sustainable pathway to extend the memory capacity of future micro-servers.
A. Programming Model
NanoStreams is proposing simple dataflow extensions to C to support a streaming parallel programming model, where tasks and the dataflow between them are explicitly identified via code annotations. We apply minimalistic and faithful extensions of the C language for explicit parallelisation and seamless scaling, considering that the programming model is deployed to support domain specific programming environments such as databases and graph processing tools, as opposed to general purpose parallel programming. This approach also fills a vital gap in landscape of existing parallel highlevel languages, namely the lack of a very basic C language extension giving the programmer full control over parallelism.
Stream parallel programming deconstructs a program into multiple kernels linked together via their input and output streams into a graph representing an algorithm. Kernels at each level of the graph may execute independently. Data driven applications with fairly regular computation tend to be a good fit for this programming model. Underpinning its programming model, the project provides a common runtime environment, the nanoruntime, suitable for elastic scaling of core provisioning and control of load balancing and data access locality between threads.
The key to exploiting the existing C language lies in having fully referentially transparent code omitting use of pointers, global variables and variable aliasing through function calls. Full use of the type system addresses issues of data locality, bandwidth, and memory hierarchy. Furthermore, by abandoning the monolithic single memory space, we can employ more appropriate and efficient memory management schemes.
We have applied our prototype implementation to our financial and healthcare use cases (Section V). We have implemented a library that enables the programmer to produce a directed acyclic graph expressed in C, to be compiled with gcc using pthreads for producing a multi-threading binary.
B. Memory Management
Extending the memory capacity of micro-servers is challenging both because of grim DDR scaling projections and because micro-servers are fundamentally power and area limited. We are exploring the use of non-volatile memory technologies as a pathway to extend the memory capacity of micro-servers with virtually no additional static power budget and controlled performance cost. The NanoStreams programming model exposes hybrid main memory composed of conventional and Non-Volatile RAM (NVM) directly to the system software for data placement and memory management. We are designing a system interface where the DRAM and NVM chips are assigned distinct physical address regions. This complies with how BIOS reports DIMMs and their physical address ranges to the operating system. The operating system can then select to allocate virtual memory pages on either type of memory, using either hints from the user level or an automated kernel-level policy. Integrating this hybrid design to the memory allocation pool of the operating system enables DRAM as a software-controlled cache while NVM is used as energy-efficient secondary storage. Compared to a hardware-controlled caching design, NanoStreams software-controlled hybrid memory provides more flexibility and optimisation. It empowers the operating system, enhanced by user-level hints, to implement systemlevel allocation policies that reduce energy consumption.
User-level management of hybrid main memory is possible through extending memory allocation functions (mmap, malloc) in C to choose NVM or DRAM as the allocation target. We intend to extend other memory interfaces too, such as numactl. Our modified memory allocation functions implement a default allocation on NVM. We furthermore extend the linker file format to provide two versions of each data segment. For instance, our extended ELF format, includes a .bss hotmem segment which holds frequently accessed, zero-initialised data destined for DRAM, while .bss holds cold data to be stored in NVM without performance penalty. Annotating data in the source code with attributes, such as "attribute ((section("bss hotmem")))", specifies global variable placement. We also provide an object migration function between DRAM and NVM, where the programmer may allocates a new copy of an object on the opposite memory type.
From the OS perspective, allocating memory on a hybrid memory system is similar to allocating memory in a nonuniform memory architecture (NUMA). Every NUMA region is split into a DRAM region and an NVM region. As such, the OS and system libraries utilise the same memory allocation algorithm for either type of memory. Moreover, virtual memory management is the same for DRAM and NVM, unmodified in comparison to a NUMA system. Using the system support described here, we have developed an LLVM compiler framework to instrument programs in order to profile the access patterns to all allocated objects, and selectively place objects in NVM or DDR. A limitation is that operating system implementations of NUMA-aware page allocation and migration may contradict programmers choice. For example, Linux will not keep track of the NUMA preference of swapped out pages and may swap them in a NUMA partition that is not in the memory type and region requested by the programmer. Moreover, there may be conflicting constraints when mapping pages into multiple virtual address spaces. These issues are subject of ongoing study.
We compare our hybrid memory allocation both to DDRonly and NVM-only designs. DDR-only designs incur a high energy overhead for storage compared to NVM-only ones but they have relatively less energy cost when accessing data objects. Our hybrid memory allocation and policies have resulted in up to 5× reduction for workloads emerging from column-oriented, key-value data stores [4] , [18] .
V. USE CASES
This section outlines three commercial use cases of the NanoStreams co-designed micro-server and software stack.
A. Reconfigurable compute in a volatile market facing infrastructure
FPGAs have become a byword for compute acceleration in financial services. However, their adoption is far from straightforward and many FPGA-based products have failed to make a lasting impact. One of the most significant constraints to GPU and FPGA adoption in Financial Services, and in particular, in the performance critical area of front office trading systems, is the extended software development lifecycle [19] .
The trading domain is a fast moving area where market behaviour can change overnight, where new regulations can be imposed with short notice and thus, where agility and adaptability are important contributors to both the financial and reputational risk exposure of the firm.
Bacon et al. [20] set out the problems faced with exposing FPGA to conventional development techniques. In particular, they highlight the observable trend towards tighter integration of heterogeneous compute capability, firstly with GPU on die and proposed FPGA on die, packages from Intel and the proliferation of ARM cores integrated alongside gates in commodity FPGA boards. Close integration in hardware makes it clear that a universal way to use these is essential to avoid requiring the coordination among different developers with different skill sets.
NanoStreams embodies a number of key technological innovations that seek to address these challenges. We use the model of option pricing, a problem commonly solved by highly parallel models, to demonstrate that the combination of microserver, networking and parallelisation tools of NanoStreams can be harnessed seamlessly. Using the nanocore abstraction, we write code for common, high-level languages that is executable both on CPUs and the the nanocore FPGA, allowing reduced hardware and energy footprints without increasing maintenance cost. NanoStreams further demonstrates the allocation of tasks to compute resources by the runtime controller, allowing a choice of execution platform to be made, with the Nanowire interconnect connecting the distributed resources seamlessly.
B. Real-time Graph Analytics at Scale
Graph analytics is fast becoming a key Big Data workload. Graph algorithms have revolutionised the way we interact with the internet, powering for instance search services, e.g. PageRank. Recently, more complicated algorithms that aim to give a far deeper analysis of the underlying graph structures and properties have been emerging. The main driving reason is the use of complex knowledge graphs as objects to represent knowledge and allow deep queries.
In all of these cases, the basic kernel relies on sparse matrixvector or sparse matrix-matrix multiplication. Although these kernels are not new in scientific computing, it is the very characteristics of the new workloads that make the sparse matrices exhibit very different sparsity characteristics than those in engineering and scientific applications. These lead to extremely irregular memory access patterns that do not typically follow a clear application driven structure. This drives the computational intensity to even lower levels than traditional applications. Therefore, acceleration of particular kernels on customised, low energy platforms is a very appealing path.
A general roadmap that we follow in NanoStreams is the targetting of the computation of a few characteristic features of the sparse matrices and the calibration of prediction models for energy and time [21] . The lessons learned and calibrated models serve as input for the co-design approach that NanoStreams follows to customise the architecture and instruction set of the Nanocores, as well as scale resources on demand.
C. Real-time Computation in Diagnostic Medicine
Today a patient in ICU is surrounded by a large number of monitoring devices recording the time variance of physiological parameters, such as blood pressure or oxygen saturation. The concept of correlating such readings and even feeding these to a predictive mathematical model, is an active research topic in clinical medicine. Earlier research [22] showed the viability of physiological monitoring to detect sleep apnea in neo-natal ICU. NanoStreams extends the concept of utilising real time data to address the challenges of improving the standard of care by using a rapidly responsive automated surveillance system in an adult ICU.
Mechanical ventilation is a common and essential therapeutic intervention performed in critical care units throughout the world, for respiratory and neuromuscular diseases, sepsis, shock, for airway protection, or for temporary support after surgery. Epidemiological studies have shown that up to 2.8% of patients admitted to hospitals undergo mechanical ventilation. Mechanical ventilation can only be performed in critical care units which are a limited and expensive resource. However, mechanical ventilation can worsen the injury in previously damaged lungs [23] and can initiate injury in normal lungs.
The NanoStreams medical use case encapsulates several computational challenges because it seeks to provide a responsive system that is scalable to incorporate multiple physiological parameters and multiple patients. An in-memory database provides a key mediation stage between the patient sensor readings and the subsequent multistage analytical processing. Performance criteria suggest that an in-memory database implementation is a key component of the architecture. Furthermore, in NanoStreams we seek to test the hypothesis of whether microservers provide a suitable platform with which VI. PUTTING IT ALL TOGETHER Our fully working NanoStreams prototype integrates an ARM-based microserver, the FPGA nanocores accelerator and the nanowire communication layer. In its current instance, the microserver is a Boston Viridis 2U rack box [24] , hosting a cluster of ARM nodes. Each ARM node consists of a quadcore ARM A9 System-on-Chip (SoC) with a shared 4MB L2 cache, 4GB of DDR3 RAM, a 250GB SATA3 disk and a Gigabit Ethernet interface. We are in the process of migrating the microserver host to the Applied Micro XGene platform.
The nanocores prototype is implemented on a Zedboard development kit [25] featuring the Xilinx Zynq ZC020 device and an integrated Gigabit Ethernet interface. The Zynq device includes a dual-core ARM processor for managing the FPGA fabric and facilitating software development. The Viridis microserver and the Zedboard accelerator connect directly via their Gigabit Ethernet interfaces. For measuring energy consumption, each machine equips a Wattsup Pro meter [26] taking power measurements at the PSU level, with a sampling frequency of 1 Hz. Additionally, we make use of on-board IPMI power monitoring sensors with a sampling frequency of 4 Hz. A digital multimeter attached to an electrical interface on the Zedboard is able to measure the whole board's power consumption with a sampling frequency of up to 25 Hz.
Each ARM node in the microserver runs Ubuntu Linux 12.04 LTS which provides a stable environment to develop the application use cases. The server of the nanowire communication layer runs as a bare metal daemon on the ARM management cores of the Zedboard. The nanowire client is implemented as an offloading library that transparently handles communication with the server. An application running on the Viridis ARM microserver compiles against the nanowire client library to access the offloading API.
A. Demo and Initial Results
We demonstrate our integrated, fully working prototype using the financial use case of option pricing. This use case emulates a realistic application scenario by replaying a realtime trading feed from the New York stock exchange over a multicast UDP channel. The Viridis microserver, extended with the nanocore accelerator, and a competing Intel server tap on this channel to retrieve stock updates and price the same set of options. Fig. 6 shows an overview of the configuration. Moreover, each machine exports real-time power and performance measurements to an external visualisation tool (Fig. 7 ) accessible on the web through HTTP. In the context of option pricing, QoS is defined as the ratio of options that have been priced before the next stock update over the total number of options to be priced. We use a single socket of an Intel Sandybridge as a baseline for comparative analysis. On the Intel server, we measure power at the processor socket level, including the attached DRAM, by using IPMI and RAPL sensors. For the Viridis microserver, we measure power at the node level, including the ARM SoC and DRAM. For the accelerator, we measure the whole board's power consumption. The accumulated power of the Viridis node and the accelerator board constitutes the total power of the NanoStreams solution. Table II summarises the results and suggests that the NanoStreams approach of co-designing microservers can greatly improve energy efficiency. For a fraction, namely 1/7, of the power consumption of Intel, NanoStreams achieves an average QoS of 49% timely priced options to a QoS of 77% of the Intel server. Observing the metrics of Time per option (T opt ) and Joules per option (J opt ), NanoStreams is roughly 20% slower than Intel when pricing an option but it reduces energy by a factor of 5.5. Note that the energy reduction from NanoStreams is less than the power reduction because of the slightly increased T opt prolonging the computation. Nevertheless, in the financial use case, the QoS and performance difference can be readily improved by scaling out NanoStreams with more nodes, while still being more energy efficient than Intel. We are further investigating possibilities for improving energy efficiency across the hardware and software stack. These include deploying multiple accelerators, improving the architecture of Nanocores, upgrading ARM hosts to 64-bit for higher energy efficiency than the existing A9 hosts, and further reducing the offloading communication cost of nanowire.
VII. CONCLUSION
NanoStreams has been successful in bringing together best practices from embedded systems design and highperformance computing. We have achieved higher energyefficiency for analytical tasks on data streams than stateof-the-art servers on real silicon prototypes, while making tangible progress in the development and adoption of new hardware technologies for the European microserver roadmap. As an industry-focused project (five out of seven project partners are companies, including three SMEs), NanoStreams has successfully engaged stakeholders who wish to explore new technologies for in-situ analytics without investing on massive warehouse scale datacentres to meet their needs.
Reflecting on the project, we have also identified areas where taking different directions could have achieved better adoption potential. We have used two compiler infrastructures, one based on ACE's CoSy framework for easy compiler generation and C programming of Nanocores, and another based on LLVM for memory access profiling. We believe that unifying everything under a single compiler infrastructures, preferably LLVM due to its prevalence, will significantly broaden the user base. We also believe that there is significant commercial opportunity for isolated components of NanoStreams, including the Nanocores and Nanowire as core datacentre components, and parts of the language technology as a concurrency extension to the C standard.
Taking a workload-specific approach to sizing and optimising system scale might limit the scope of our results and not cover more general solutions for Edge or high-capacity datacentres. These limitations combined with an intention to broaden the micro-server market potential in Europe have led us to form a new Horizon2020 project, UniServer, which since February 2016 explores general-purpose micro-servers and system software that jointly cope with and exploit intrinsic architectural variation to improve efficiency across a range of Edge computing and IoT workloads.
