Heterogeneous Servers based on Programmable Cores and Dataflow Engines by Wu, Yun et al.
Heterogeneous Servers based on Programmable Cores and Dataflow
Engines
Wu, Y., Gillan, C., Minhas, U., Barbhuiya, S., Novakovic, A., Tovletoglou, K., ... Nikolopoulos, D. (2017).
Heterogeneous Servers based on Programmable Cores and Dataflow Engines. In Workshop Energy
efficient Servers for Cloud and Edge Computing 2017
Published in:
Workshop Energy efficient Servers for Cloud and Edge Computing 2017
Document Version:
Peer reviewed version
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
Publisher rights
Copyright The Author 2017
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:09. Sep. 2018
Heterogeneous Servers based on Programmable Cores
and Dataflow Engines
Yun Wu,Charles Gillan,Umar Minhas,Sakil Barbhuiya,
Alexsandar Novakovic, Kostas Tovletoglou, George Tzenakis,
Hans Vandierendonck,Georgios Karakonstantis,Dimitrios Nikolopoulos
Queen’s University Belfast
yun.wu,c.gillan, u.minhas, sbarbhuiya03@qub.ac.uk
a.novakovic, ktovletoglou01, gtzenakis01@qub.ac.uk
h.vandierendonck,g.karakonstantis,d.nikolopoulos@qub.ac.uk
ABSTRACT
The continuous growth of the internet, and the data vol-
umes associated to it, put severe pressure on the compu-
tational, storage and networking resources available in to-
day’s data-centers. The stringent power budgets and the
difficult expansion of current Internet communication infras-
tructures necessitates the design of new server architectures,
programming models and run-time systems. In this paper,
we present promising options in the design of energy-efficient
servers based on programmable accelerators which can help
push the limits on the computation of future data centers.
We show that one of the proposed micro-server prototypes
based on custom tiny cores can achieve 40% better energy-
efficiency than a standard Xeon Server, while dataflow en-
gines can help speedup execution by up to 374x for various
workloads.
Keywords
Big Data; IoT; Data Center; Heterogeneous Computing; Ac-
celerators; Dataflow Engines
1. INTRODUCTION
The rapid growth over recent years in the number of de-
vices being connected to the Internet is driving a rapid up-
scaling in the amount of the data that are being both gen-
erated and communicated to data centers [5]. Presently, we
rely on massive data centers that contain tens of thousands
of servers equipped with multiple cores and huge amounts of
memory for processing and storing those data. However, as
the Internet-of-Things (IoT) unfolds, the number of smart
connected devices will increase dramatically and it will be-
come doubtful if we can keep servicing the associated data
volumes by merely scaling up the current infrastructure.
In fact, it is difficult to imagine how to transfer these
data over the existing public networks and satisfy the low
response latencies of many emerging applications without
further substantial investments to expand the existing net-
works, which might take years to complete if at all tech-
nologically feasible. Such a challenge have urged designers
to rethink the Cloud computing model leading to the in-
troduction of a new paradigm of Edge computing according
to which data are being pre-processed at the edge of the
Clouds close to the users. Such a paradigm can help alleviate
the pressure of high data volume and fast response latency
from data centers. However, to enable such a paradigm new
servers are needed that can easily be deployed and main-
tained at edge environments and are much more energy ef-
ficient than the classical ones.
In fact, new more energy efficient server architectures are
needed not only for enabling the Edge Computing paradigm
but also for continuing supporting the Cloud Computing
since due to the utilization wall that has been recently dis-
covered and which hinders the up scaling of the server pro-
cessing capabilities. In particular, studies have shown that
such a utilization wall puts a limit to the number of transis-
tors within a chip that can be powered up at full speed, a
phenomenon called Dark Silicon [6]. The scale of the chal-
lenge is so dramatic that some in the industry started [12] to
warn that by 2020 the packing of billions of transistors and
the very tight power budgets may permit to be active only
nine percent of the on-chip transistors at any point in time.
Such claims have elevated power as a prime design parame-
ter making apparent that in order to improve performance
by using more resources or operating at a higher frequency,
new server architectures are needed to provide the required
levels of energy efficiency.
To address such a challenge, new heterogeneous server ar-
chitectures based on energy-efficient programmable accel-
erators on Field Programmable Gate Arrays (FPGAs) [10,
11] and/or embedded-processors [7] have been introduced.
Such heterogeneous architectures may have the potential to
achieve energy-efficient computation at latencies, through-
put and density that exceed that of general-purpose pro-
cessors energy efficiency but there is still a need to quan-
tify the achieved trade-offs while rethinking the hardware-
system software interfaces for enabling efficient utilization
of the heterogeneous resources.
In this paper, we discuss our research efforts, challenges
and results on the design and software integration of hetero-
geneous server architectures, based on reconfigurable cus-
tom processors, referred to as nanocores as well as on pro-
grammable dataflow engines.
In particular, in Section 2 we discuss a new micro-server
architecture based on reconfigurable tiny cores optimized for
FPGAs and on a novel bare-metal Ethernet networking in-
frastructure, which achieves low latency access and sharing
of accelerators without affecting the host processor architec-
ture.
In Section 3, we discuss another option for the implemen-
tation of the accelerators based on data flow engines and an-
alyze the results and the open challenges in enabling their
utilization by traditional data center programming frame-
works.
Finally, conclusions are being drawn in Section 4.
2. SERVERSBASEDONRECONFIGURABLE
CORES
2.1 System Architecture
One strategy in our work is to explore a scale-out micro-
server architecture that can serve the emerging edge com-
puting ecosystem, namely the provisioning of advanced com-
putational, storage, and networking capability near data
sources to achieve both low latency event processing and
high throughput analytical processing. The research targets
the challenges of processing streams of data in real-time by
using ARM based microservers and developing an Analytics-
on-Chip (AoC) architecture based on Xilinx Zynq-7000 All
Programmable SoC family which functions as an accelerator
attached to these.
The AoC uses an amalgam of low-power RISC processors
for the embedded systems domains and Nanocores, a new
class of programmable compute units. The AoC processor
is a heterogeneous SoC that reduces latency in processing
of streaming operators issued to the micro-server with the
latency-optimised RISC cores, while improving analytical
processing throughput on compute and data intensive tasks
with the Nanocores.
Complementing the AoC architecture is a low latency
communication protocol which we call NanoWire. This en-
ables communication between CPU hosts, either microserver
or server class systems, and the AoC engines. Recognising
that Ethernet is now a ubiquitous interconnect technology,
NanoWire is enabled directly on the Ethernet layer. This
means that it benefits from the scalability of Ethernet tech-
nology and also that one accelerator, enabled for NanoWire
communications, can be easily shared between many hosts
in the datacentre.
In the next two subsections we describe the Nanocore and
NanoWire respectively. We then describe three of the ex-
emplar applications which are executed on the system.
2.1.1 Nanocores
Nanocores are a new class of programmable and config-
urable processors. In previous publications [8, 9] we ex-
plained the initial implementation of the AoC architecture
on a Zedboard. Recently we have moved this onto an Avnet
mini-ITX platform, the specification for which is shown in
table 1. This further level of integration reduces makes use of
a larger FPGA device fitting more Nanocores (32 Nanocores
compared to 8 Nanocores on Zedboard) on the SoC and en-
abling more efficient on-chip
parallelisation of a range of analytical tasks. As with the
initial prototype, this AoC instance supports 32-bit and 64-
bit fixed point arithmetic. The rearchitected fabric in this
new version supports multiple Nanocore groups (4 groups)
and allows chains of 8 Nanocores to be accessed indepen-
dently. The key aspect of the Nanocore is to optimise the un-
derlying FPGA hardware to allow the creation of a light core
which operates substantially faster than existing FPGA-
based cores. Figure 1 shows the architecture of the Nanocore.
It is widely known that the software development lifecy-
cle for accelerators and in particularly for FPGAs is longer
Figure 1: Block diagram of the Nanocore
Attribute Value
Zynq device Z-7100
Interconnect 1 Gb Ethernet
RAM 1GB PL, 1 GB shared DDR3
Processor Core Dual-core ARM Cortex A9
Max frequency 800 MHz
Programmable Logic Kintex-7 FPGA
LUTs 227,400
Flip-flops 554,800
BRAM 755
DSP48s 2,020
Table 1: Parameters defining the Avnet Mini-ITX board on
which the latest Nanocore prototype is implemented
than for implementation on either CPUs or GPUs. This
fact mitigates against use of FPGAs in industries which are
sensitive to cost pressures and/or development times, a no-
table example being the high frequency trading field in fi-
nance. Agility and adaptability are important contributors
to both the financial and reputational risk exposure of fi-
nancial firms. The fact that Nanocores are significantly eas-
ily and quickly programmed than using VHDL or Verilog
directly serves to remove the disadvantage associated with
using FPGAs. Figure 2 indicates that the AoC can be op-
Figure 2: Illustration of the reprogramability of the
nanocore.
erated in a similar way to GPU and other accelerators. The
compiler on the right hand side of the figure is implemented
by defining the nanocore instruction set as a new backend
to LLVM.
2.1.2 NanoWire
NanoWire has two layers accounting on the one hand for
the process of transporting packets between the host and
accelerators and on the other the operation of issuing re-
quest to tasks running on the accelerator. We call these the
Host-Accelerator Transport (HAT) and Task Issue Protocol
(TIP) respectively. Together they create a communications
substrate which:
1. has a simple and convenient API to virtualise and man-
age accelerators,
2. creates reliable, high-throughput and low latency trans-
fers,
3. uses minimal CPU cycles on the host.
The layers of the NanoWire stack are shown in figure 3.
Figure 3: Layers of the NanoWire protocol
The HAT is the network tier of the overall microserver
(called Nanostreams) system architecture and offers a com-
mon abstraction of the network level services and I/O prim-
itives to both the host and accelerator nodes. It corresponds
to encapsulating each NanoWire packet directly within one
Ethernet frame, a process which requires the use of a custom
EtherType field within the standard Ethernet header. HAT
allows multiple hosts to share the same accelerator, it sup-
ports variable size packets (up to the Ethernet MTU size),
and supports reliable transmission. Packets can be switched
via Ethernet switches, however, HAT provides the ability to
further customise the Ethernet header for efficiency, when
switching is not required.
Finally, HAT provides lightweight connection-less chan-
nels as the lowest-level communication. A channel consists
of a point-to-point unidirectional queue of packet slots used
for communication between a source host node and a des-
tination accelerator node. Resources per channel (e.g. el-
ement size) are chosen at creation time. Channels aim at
providing a low overhead and low-latency communication
path, while allowing the system to tune resources and re-
source placement for each channel.
NanoWire avoids the use of sockets and therefore elimi-
nates the use of the kernel IP stack. Instead NanoWire offers
TIP as a task queue layer that issues task requests from the
hosts and receives task results from the accelerators. TIP
therefore provides the network tier of NanoStreams and of-
fers a common abstraction of the network level services and
I/O primitives to both the host and accelerator nodes. TIP
code runs on both sides of the interconnect and although
the host side makes use of the host OS services, on the ac-
celerator side TIP will run on more diverse platforms. In
our prototype the accelerator side of HAT is implemented
as custom firmware directly on top of the ARM processors
of the FPGA card.
2.2 Case Studies and Results
We have applied the proposed architecture and software
stack in various case studies. In the next paragraphs we
discuss two commercial use cases covering financial markets
and respiratory physiology in intensive care units.
2.2.1 Option pricing in the financial markets
Figure 4 shows the architecture of the proposed microserver
(i.e. Nanostreams) system for pricing European stock op-
tions. Stock exchanges broadcast real-time price updates
as a market data feed, with each exchange generally having
some unique format. The linehandler software written by
each client consumes the data feed converting it to a private
format. In general the industry is dominated by UDP Mul-
ticast (UDP-MC/IP) and this is what we have used to relay
the feed to the microservers in our lab set up.
As we have explained above, the AoC Accelerators are
connected to the Ethernet in our lab and NanoWire allows
for low latency communications between the microservers
and the AoC. The number of AoCs in the systems (as well
as Nanocores in each AoC) in the system is scalable and
each can be dynamically shared as needed between the mi-
croservers.
Gateway node
FPGA
Nanocore
Accelerator
FPGA
Nanocore
Accelerator
NYSE session Odroid
board
UDP-MC / IP protocol Nanowire protocol
Ethernet switch
Session replay
Figure 4: Architecture of the NanoStreams system comput-
ing financial option prices.
By using the first prototype of the AoC engine we have
evaluated the performance of the option pricing case study
and we compared it to an implementation on a single socket
of a Intel Sandybridge server. The results are shown in ta-
ble 2. Here, results are presented from pre power supply
measurements which defines the total power budget.
Platforms Power(W) Topt (s) Jopt (J)
ARM + NanoCore (1st prototype) 5.1 0.201 1.025
ARM + NAnoCore (2nd prototype) 13.9 0.051 0.71
Intel 108.75 .016 1.74
Table 2: Performance parameters for the financial option
pricing use case
The table compares the metrics of Joules per option Jopt
and time per option in seconds Topt. The NanoStreams
system consumes only about 40% of the power compared to
Intel platform.
Using the latest implementation on the Avnet Mini-ITX
platform (2nd prototype) enhanced performance is obtained.
Performance increased fourfold, while the pre-PSU power
consumption metrics increased by a factor three relative to
the first prototype AoC.
2.2.2 Intensive care medicine
An intensive card unit (ICU) is a data rich, network en-
abled environment in which multiple physiological parame-
ters for each patient are recorded into databases and mon-
itored. In our work we are employing real-time analysis
Figure 5: Graphical presentation of the Pearson correlation
coefficients for database performance.
of streams of respiratory data from mechanical ventilators.
Trend analysis and thresholding are applied to tidal vol-
umes and airway pressure readings over varying time scales
and alerts are issued via SMS to clinicians when pre-defined
trends are observed. Determining such trends is a matter
for clincial research, so that our tool is currently providing
a mechanism to find the optimal intervention path.
The performance of the database at the heart of the ap-
plication has been analysed for ingress of patient data. We
measured similar metrics to those used in the financial case
above. Figure 5 presents the Pearson correlation coefficients
in an innovative graphical way. Each metric is a node in a
graph and the proximity of the metrics to each other rep-
resents the overall magnitude of their correlations. Clusters
are immediately evident in the figure therefore. For exam-
ple, the cluster at the top left of the figures shows that An-
erage RAM per insert (AVG.RAM) correlates strongly with
Average Inserts per second (AVG.INS.S) and Joules per In-
sert (J.INS) Each path in figure 5 represents a correlation
between the two metrics that it joins. A lighter path rep-
resents a positive correlation, and a darker path represents
a negative correlation. The width and transparency of the
line represent the strength of the correlation (wider and less
transparent means stronger correlation).
Turning to data retrieval and computation of trends and
reporting, table 3 shows the breakdown in terms of per-
centage CPU time for each operation. This shows that the
NanoCore is ideally suited for off-loading the metric com-
putation. This mathematical operation, which is similar to
a moving average filter utilizes the following characteristics
on the Nanocore.
Functionality Percentage CPU time
Data retrieval 19
Compute metrics 68
Report formatting 13
Table 3: Partition of CPU effort for thresholding of ICU
parameters
Nanocores are light weight cores capable of processing
data being streamed in order. Stream of multiple patients’
vital parameters is scattered in a way that each core pro-
cesses one patient. That offers parallelism and easy tracking
of data against patients’ IDs. Each core keeps past values
in memory and computes the new running average after re-
ceiving the current value. The new average is compared
against threshold and transmitted to microserver. The pro-
cessing time increases with a larger time scale but for the
tested cases of up to 5-15 mins, each core can process about
4-4.5 Mega readings per second. The number scales almost
linearly with addition of multiple cores.
3. ACCELERATORSBASEDONRECONFIG-
URABLE DATAFLOW ENGINES
In this section, we discuss a server architecture based on
dataflow engines and present some initial comparative re-
sults against a classical multi-core server.
3.1 System Architecture
3.1.1 Dataflow Engine
Programmable accelerators based on dataflow engines is
another promising solution that offers a revolutionary way
of performing computation, completely different to comput-
ing with conventional CPUs [13, 3]. Instead of control flow
based computation, the dataflow based computation is in-
troduced where the powerful multi-tasking CPU cores are
replaced with a vast number of single-tasking dataflow cores
on FPGAs which are specifically customized and optimized
fitting different algorithm features. By focusing on optimiz-
ing the movement of data in an application, number of order
of magnitude benefits in performance, space and power con-
sumption are gained through massive parallel processing.
A key challenge that prohibits the wide utilization of such
dataflow engines in servers lies on the difficult integration of
the models that are needed for programming such accelera-
tors that may differ significantly from multi-core program-
ming frameworks. The main difficulties lie on the seman-
tic gap between accelerators and general-purpose processors
(e.g., differences in the memory model), as well as on the
lack of resource management support (e.g., for allocation
and scheduling of tasks on accelerators). Currently the high-
performance computing community has adopted hybrid pro-
gramming models for enabling the use of accelerators, how-
ever such solutions are highly undesirable for virtualized ac-
celerators in the cloud where the underlying resources must
be transparent to the cloud users. To address the limitations
of the hybrid programming models we propose alternatively
to adopt a library of implementations of algorithms on ac-
celerators and built a suitable infrastructure for being used
from high-level programming frameworks.
3.1.2 Spark
Spark is an open sourced general framework for distributed
computing that offers high performance for both batch and
interactive processing and a fast and general engine for large-
scale data processing [2]. It exposes APIs for Java, Python,
and Scala and consists of Spark core and several related
projects which offers seamlessly API support database queries,
scalable fault-tolerance, machine learning and graph-parallel
computation. By running Spark application locally or dis-
tributed across a cluster, it is interactively executed and
commonly performed during the data-exploration phase with
ad-hoc analysis capability.
As a programming platform with unified data manage-
ment, it greatly speeds up the operation and maintenance
as an ’all-in-one’ solutions [1]. The interactive consoles of-
fers interactive data analysis which seamlessly hooks up to
a connected cluster. However, in order to adopt the advan-
tages of Spark on DFE server, a new interfacing ’glue’ is
lacked with seamless adaptation to the feature of dataflow
processing. Due to the different parallelism degree of DFE
cluster compared to traditional CPU servers, a efficient data
processing venue is required to accommodate the merits of
both high level language and DFE.
3.1.3 Infrastructure
Those challenges to address in this work are to (i) hide
the accelerator from the programmer by presenting it as a
library function, embeddable in query processing, data pro-
cessing or aggregation tasks; (ii) extend the runtime sys-
tems of high-level analytics languages to handle efficiently
scheduling, communication, and synchronization with pro-
grammable accelerators; and (iii) improve the performance
robustness of analytics written in high-level languages against
performance artefacts of virtualization.
Figure 6 shows the entire system architecture of our ap-
proach. By using the Spark as high level programming lan-
guage for big data and DFE as the accelerating units in
the data center, the system is glued with Java Native In-
terface (JNI) and SLiC/MexelerOS (a run-time that allows
accessing DFE through C based API). The DFE is generated
through OpenSPL [4] programming model which produces
a repository of a library for different run-time configura-
tions as files with suffix ’.max’. The Resilient Distributed
Dataset (RDD) is transparent from Spark to DFE through
our middle-ware and transmitted to DFE through infinite
band between the host and accelerators.
3.2 Initial Results
In Figure 7, the Linear Regression from Maxeler Applica-
tion Gallery is utilized as the benchmarks that we used for
evaluating our infrastructure. By using the math library in
Spark, the comparison and breakdown is carried out for the
task execution between host CPU and the DFE accelerators.
In our experiments, we have characterized the overhead
of off-loading tasks from Spark onto Maxeler DataFlow En-
gines (DFE) and evaluated the performance of linear corre-
lation when executing on the DFE in comparison to its exe-
cution in Spark on 8 CPUs. Our results revealed a speedup
of up to 374x for data set sizes ranging from 100 to 10M
pairs of floats as shown in Figure 7a. Moreover, we have
measured that moving data between the Java Virtual Ma-
chine and the DFE has about 3% overhead over execution
of the DFE kernel as shown in Figure 7b, which makes this
overhead practically negligible. In contrast, allocating and
Applicationsli i
Data Centre 
SLiC
MaxOS
JNI
Middlewarei l
R
e
si
lie
n
t 
D
is
tr
ib
u
te
d
 D
at
as
e
t 
(R
D
D
)
In
fi
n
it
e
 B
an
d
Dataflow
.max
Figure 6: The Big Data Infrastructure on DFE
configuring the DFE takes about 3.2 seconds, and bears sig-
nificant overhead.
0 2 4 6 8 10
Data Size ×106
0
100
200
300
400
Sp
ee
du
p
46
74.67
254.56 272.25
319.06 328.17
341.88
(a) Performance Speedup
0 2 4 6 8 10
Data Size ×106
0
5
10
15
20
25
O
ve
rh
ea
d 
(%
) 20
16.67
4.79
2.56 3.13 3.23
4.19
(b) Run-time Overhead
Figure 7: Linear Regression
As illustrated from the results in Figure 7, larger stream-
ing data batch size gains significant performance improve-
ment compared to the smaller ones. At the same time,
the overhead is kept low once the streaming data batch
size reaches certain level, 106 floats in this cases. Our ini-
tial results indicate that building big data applications on
DFE through our infrastructure is very promising. Acceler-
ators based on DFEs can make a step change in the energy-
efficiency and density of data centers. Seamless integration
of accelerators in data center programming frameworks is
however an open question, which will be addressed by fu-
ture enhancements of our infrastructure.
4. CONCLUSIONS
This paper has presented two diferent approaches to the
challenge of providing energy efficient servers targeting com-
puting at the edge as well as in cloud environments. Nanocore
and Nanowire bring together best practices from embedded
systems design and high performance computing to achieve
higher energy efficiency for analytical tasks on data streams
than state of the art servers. A real-silicon prototype, based
on the Xilinx Zynq platform and ARM-Linux has been shown
to be competitive, for workloads drawn from different sec-
tors, when compared to contemporary HPC servers, sustain-
ing transactional throughput and improving system energy-
efficiency and programmability.
The Dataflow approach, using Maxeler technology, pro-
vides an alternative solution which is also promising. While
Nanocore and Nanowire focus on integer operations, the
DFE enables floating point computation and provides speed-
ups of over 300x compared to traditional servers.
Our work funded by the European Comission is contribut-
ing to the wider effort in Europe to create a sever ecosys-
tem. It exploits intrinsic architectural variation to improve
efficiency across a range of Edge computing and IoT work-
loads, areas which will dominate the application space for
the foreseeable future.
5. ACKNOWLEDGMENTS
This research is supported in part by the European Com-
munity under the NanoStreams (FP7, contract 610528), VINE-
YARD (Horizon2020, grant 687628) and ASAP projects (FP7,
grant 608224).
6. REFERENCES
[1] Apache Spark: An All-In-One Tool For The
Data-Driven Enterprise.
http://syntelli.com/blog/apache-spark-an-all-in-one-
tool-for-the-data-driven-enterprise/. Accessed:
2016-11-10.
[2] Apache SparkTM is a fast and general engine for
large-scale data processing. http://spark.apache.org/.
Accessed: 2016-11-10.
[3] Dataflow Computing.
https://www.maxeler.com/technology/dataflow-
computing/. Accessed:
2016-11-10.
[4] The Open Spatial Programming Language: OpenSPL.
http://www.openspl.org/. Accessed: 2016-11-10.
[5] Cisco. Cisco Visual Networking Index: Global Mobile
Data Traffic Forecast Update, 2015aˆA˘S¸2020 White
Paper. 2016.
[6] H. Esmaeilzadeh, E. Blem, R. St. Amant,
K. Sankaralingam, and D. Burger. Dark Silicon and
the End of Multicore Scaling. SIGARCH Comput.
Archit. News, 39(3):365–376, June 2011.
[7] G. Georgakoudis, C. Gillan, A. Hassan, U. Minhas,
G. Tzenakis, I. Spence, H. Vandierendonck, R. Woods,
D. Nikolopoulos, M. Shyamsundar, P. Barber,
M. Russell, A. Bilas, S. Kaloutsakis, H. Giefers,
P. Staar, C. Bekas, N. Horlock, R. Faloon, and
C. Pattison. NanoStreams: Codesigned Microservers
for Edge Analytics in Real Time. In Proceedings: 2016
International Conference on Embedded Computer
Systems: Architectures, Modeling and Simulation
(SAMOS XVI), 5 2016.
[8] G. Georgakoudis, C. J. Gillan, A. Sayed, I. Spence,
R. Faloon, and D. S. Nikolopoulos. ISO-Quality of
Service: Fairly Ranking Servers for Real-Time Data
Analytics. In Parallel Processing Letters, 2015.
[9] G. Georgakoudis, C. J. Gillan, A. Sayed, I. Spence,
R. Faloon, and D. S. Nikolopoulos. Methods and
Metrics for Fair Server Assessment Under Real-Time
Financial Workloads. In Concurrency and
Computation: Practice and Experience, 2015.
[10] Z. Lin and P. Chow. ZCluster: A Zynq-based Hadoop
Cluster. In 2013 International Conference on
Field-Programmable Technology (FPT), pages
450–453, Dec 2013.
[11] P. Moorthy and N. Kapre. Zedwulf:
Power-Performance Tradeoffs of a 32-Node Zynq SoC
Cluster. In 2015 IEEE 23rd Annual International
Symposium on Field-Programmable Custom
Computing Machines (FCCM), pages 68–75, May
2015.
[12] M. Muller. New mbed IoT Device Platform. ARM
TechCon, 2014.
[13] O. Pell and V. Averbukh. Maximum Performance
Computing with Dataflow Engines. Computing in
Science Engineering, 14(4):98–103, July 2012.
