Energy consumption and total cost of ownership are daunting challenges for Datacenters, because they scale disproportionately with performance. Datacenters running financial analytics may incur extremely high operational costs in order to meet performance and latency requirements of their hosted applications. Recently, ARM-based microservers have emerged as a viable alternative to high-end servers, promising scalable performance via scale-out approaches and low energy consumption.
INTRODUCTION
Datacenters carry an immensely high total cost of ownership (TCO) and carbon footprint. This can be traced to the fact that datacenters are notoriously wasteful, often using as little as 10% of the supplied power for actual data storage and processing, while exhibiting less than 20% node utilization [1, 2] . Market analysts Frost and Sullivan report that in 2013 the aggregate power consumption of datacenters worldwide was approximately 30 GigaWatts [3] , of which the USA was the largest consumer (10.56GW) followed by the UK (3.1GW) and Germany (2.85GW). Furthermore the North American and European requirements are currently growing at a rate of 6% annually.
Datacenters are paramount for enabling low latency algorithmic trading, arbitrage detection and portfolio hedging in the financial markets. A key enabler of low latency is that these datacenters are co-located with trading venues in order to minimize network delays. However, this design approach removes the freedom to site the datacenter in geographical regions with more preferential energy prices, something that is done for large-scale data processing in other market sectors. Given that high volume market feeds such as OPRA [4] can sustain data rates in excess of 600 Mb/s, saturating multiple UDP multicast channels and requiring dozens of CPUs dedicated to processing the data feed, there is a clear need to seek lower power processing options.
Microservers represent one low-power option that can reduce the energy consumption of datacenters by replacing server-class processors with embedded processors, such as the processors that power smartphones [5] . Back-of-theenvelope estimates suggest that microserver processors such as the Samsung Exynos and Calxeda Highbank ECX-1000 [6] consume 5 to 30 times less power than high-end server-class counterparts, such as Intel's Sandy Bridge and are 5 to 15 times more energy-efficient (in integer performance per Watt) [7, 8] . However, these microserver processors are also limited in floating point performance and SIMD/SIMT acceleration capabilities [5] .
In this paper we present a study, using a common code base, which compares the energy cost associated with computation of financial option prices, using different mathematical methods, on an ARM based microserver and on a state-of-the-art Intel Xeon server. The main contributions of this paper are:
• The analysis of financial option pricing computational kernels, namely Monte Carlo (MC) and Binomial Tree (BT), for energy consumption and performance on an ARM-based microserver platform and a standard x86 server. For this analysis we use actual market data for options contracts and stock price datafeeds to emulate a production-level environment. We instrument software and use platform specific, in-band power sensors, as well as an out-of-band, wall socket power meter to obtain power measurements.
• We provide a fair comparison for energy consumption using a common code base on all platforms. Moreover, we define a platform independent Joules/option metric to quantify energy efficiency. Experiments using standalone kernel execution show that microservers consume as little as 60% of the energy per option priced compared to an x86 server, via a scale-out approach.
• We show that microservers are a viable alternative to x86 servers, offering lower energy consumption, when the computational throughput matches the deadline requirements for option pricing from the data feed. Also, we perform session-wide experiments using actual market data to define Quality of Service (QoS) as the ratio of successful over total requested pricings. Experimental results show that QoS improves linearly by scaling out the ARM microserver nodes. Moreover, compared to the Intel server, a scaled-out ARM microserver reduces energy consumption up to 30% while meeting the same performance requirements.
The paper begins by briefly defining financial option contracts and explaining the computational methods that we have used to obtain contract prices from spot prices (Section 2). We then present details of the server and microserver platforms on which we have performed our experiments, along with our measurement methodology and metrics (Section 3). We proceed by presenting results from standalone kernel and session-wide experiments using an actual market feed (Section 4). We discuss related work (Section 5) and provide some further details on how this work fits into the wider context of a project that explores microserver technologies for real-time data analytics (Section 6). We finish the paper with summarizing conclusions drawing from our present work (Section 7).
COMPUTING OPTION PRICES
In finance, the term Option means a derivative product which is a contract giving the holder the right to either buy (Call option) or to sell (Put option) one or more underlying assets, such as a fixed number of shares in a company, for a defined price and either on or before a contract end date. An Options contract, unlike a Futures contract, does not impose an obligation on the holder to exercise their right. There are several types of Options distinguished by the terms in their contract. We use the Monte Carlo and Binomial Tree option pricing models which can price both the so-called exotic American or Bermudan options as well as European options. We focus on European options for which the BlackScholes equation provides a closed form solution, against which our results can be compared for accuracy.
Black and Scholes [9, 10] proposed a second-order partial differential equation which models the variation of the price of a European vanilla option, contractual strike price P , over time in years T , assuming that
• the underlying asset price (spot price), S follows a log normal distribution,
• the volatility σ of S is constant and
• the risk free interest rate, or rate of return, r, expressed with continuous compounding is constant.
Under these conditions an analytic solution is
In this equation p = 1 for a Call option and p = 0 for a Put option and d1 and d2 are defined by
where N (x) is the cumulative normal distribution function (CDF).
Moving beyond the limitations of the Black-Scholes model, the price of an option on any given date can be modeled using stochastic calculus, essentially simulating the path of the underlying variables over a set of paths within a time window. Analytical solutions for the stochastic equations are not generally possible so that a variety of computational numerical solution methods have been developed. European vanilla options can also be priced using these numerical methods. We apply two models, with distinct computational characteristics, as described next.
Monte Carlo (MC)
At the center of a Monte Carlo simulation of option pricing is a rate-limiting for-loop, an aspect that it shares in common with HPC applications in many fields. For a Put option the formula is [11] :
In equation 3, xi (i = 1 . . . N ) are a set of random numbers drawn from the standard normal distribution. The formula for a Call option is similar to the Put option. Given that σ, r and T are constant within the context of the loop, the implementation of equation 3 creates a compute bound process dominated by the exponential function, with relatively few load/stores to memory. One bottleneck is the control of numerical round-off error associated with a floating point summation process over a large number of terms [12, 13] . Another is the generation of sets of good quality pseudo-random numbers (PRN) because the computed price converges only slowly as O
. This is significant because the operation count during execution of the for-loop code scales as O(N ). In our work, uniform random numbers are generated using the 32-bit version of the Mersenne Twister algorithm [14] , then transformed to a standard normal distribution using the Box-Muller formula.
Binomial Tree (BT)
The binomial tree option pricing model [15] builds a lattice of options prices with a root node at the starting date and multiple nodes at the end date, each date corresponding to a level in the lattice. Thus the binomial option pricing model decomposes the time variable into discrete steps. The price can be computed at any level in the tree, which corresponds to an intermediate date within the contract, thereby making the binomial model suitable for pricing American and Bermudan options. In general at each time point i, there is a vector of option prices computed, Sij. Given price Sij and a pair of factors u and d representing the up and down movements, the two possible prices at the next level are S (i+1)j = uSij and S (i+1)(j+1) = dSij where the factors u, d are constant across the tree and are formally defined as:
The complete algorithm therefore has three steps:
• Given S0, the current spot price, work forward from today, the date of computation, to the expiration date (timepoint N), applying the up and down factors at each step and thereby computing all final node prices SNj.
• At each final node of the tree (level n) compute the exercise value.
• Iterate backwards from the final nodes in the tree and for every intermediate node compute the option price assuming risk neutrality, which means that the price is computed as the discounted value of the future payoff.
The final step, a nested for-loop, dominates the time taken by the computation. The number of updates performed scales as O(
) with each update consisting of two floating point multiplications and one floating point addition. This contrasts with the MC algorithm, where there is a need to repeatedly compute transcendental mathematical functions. The convergence of the method is a function of the number of timesteps chosen for the computations.
It is not necessary to hold all of the prices Sij in memory, at the same time. Only the prices at timepoint i are needed. A vector with the elements being overwritten as the values are processed during the backwards iteration process suffices for this purpose. Thus the BT implementation is dominated by pointer de-referencing to generate array indices and by data move activity, in contrast to the sum reduction characteristics of the MC algorithm.
Common Code Base
We developed a common code base in the C language to be used on all platforms for the experiments. The heart of our system is an OptionPricer implementing the algorithms described above. We believe that this is the fairest way to perform a comparison between platforms, rather than to invest heavily in tailoring code for each platform. The GCC compiler has the same flags on all platforms: -O2 -Wall -g -fPIC.
The metrics that we report across all platforms reflect a fixed development effort (approx. 30 days) to create and test the code base without using any architecture specific optimizations, such as SIMD intrinsics. Better performance could likely be gained by investing a further, but a priori unquantifiable, amount of programmer time to optimize code for enhanced performance. We leave this as future work.
Note that our servers build on different hardware which by design exhibits significant differences in performance and power. For our comparison we define metrics which characterize application-specific performance and energy consumption in a platform independent manner. Moreover, we conduct experiments where we scale-out the microserver platform so that iso-performance and iso-power comparisons to the high-end server are possible.
EXPERIMENTAL SETUP
Our experimental setup consists of two platforms on which we execute the OptionPricer under various workloads, measuring energy consumption and performance in each case. The next Section defines the metrics that we report, Section 3.2 describes the various platforms and Section 3.3 provides details of the methodology used to obtain the power readings.
Definition of Metrics
Option pricing in finance mainly takes place by consuming a live streaming data feed of market prices, often within the context of high frequency trading (HFT), and for pretrade risk analytics. The execution time characteristics of option pricing are different from those of numerical simulation in computational science using HPC. One aspect is that scientific codes tend to have measurable setup and post-processing phases, often dominated by integer arithmetic and with different power profiles, as well as intense floating point loop based computations. By contrast, option pricing runs relatively small standalone kernels at very high frequency with little set up and post processing phases. Option pricing on live market data feeds is in essence a form of event processing. Based on these distinctions we present three metrics used in this paper to quantify the viability of microservers in financial analytics:
Joules/option The energy consumed per execution of a pricing kernel is a fundamental metric given that this step is repeated at very high frequency throughout a trading session. In the case of an actively traded stock, with a high number of defined option contracts, this building block is repeated throughout the trading day without a break. Correspondingly, a reduction in this value can translate to significant energy efficiency for providers offering option pricing services within their portfolio.
Time/option In contrast to providers, end users, particularly those engaged in HFT, are sensitive to end-to-end latency, thereby constraining the elapsed time per option metric, which in turn defines the total time to price all contracts for a given stock. Option pricing shares this time-to-solution performance metric in common with HPC applications. Any change to the J/option metric, caused by introducing low power microservers, has to be balanced against any change in the Time/option metric, as providers compete on latency statistics.
QoS New prices may arrive at any time in a trading session. This means that any contracts not yet priced using the previous price update are abandoned and deemed to be failures. Related to the Time/option metric, but also dependent on market activity, we define a Quality of Service metric (QoS) as the ratio of successful to the total requested option pricings. The QoS metric is an application-specific measure on meeting option pricing performance requirements. It is useful for characterizing application-related performance and scalability offered by deploying multiple nodes.
Hardware Platforms
We used two platforms, one state-of-the-art HPC server architecture based on an Intel processor and a microserver based on an ARM processor. In our experiments, the 4.7.3 version of the GCC compiler was used for code generation on both platforms. Both platforms offer the possibility of scaling their frequency and voltage through a DVFS interface. The settings of these values impacts the performance of the CPU. The maximum performance is obtained when these values are set to their upper limits and correspondingly when the values are set to their minimum we obtain the lowest performance. We call the former setting performance mode and the latter powersave mode.
Other details of the two platforms are as follows:
Intel is an x86-64 server with a quad-core Intel Xeon E5-2609v2 processor (IvyBridge architecture). A quad core is used in this work in order to be equivalent to the four ARM cores available on each of the Viridis nodes described below. This CPU does not have hyperthreading capabilities. However, even for processors with the hyperthreading capability, the recommendation for HPC is to have it disabled. In powersave mode the frequency is 1.2 GHz while in performance mode it is 2.5 GHz. The server is equipped with 16 GB of DRAM and runs the CentOS 6.5 operating system with kernel version 2.6.32.
Viridis The Viridis server is a 2U rack mounted server containing sixteen microserver nodes connected internally by a high-speed 10 Gb Ethernet network. This means that the platform appears logically as sixteen servers within one box. Each node is a Calxeda EnergyCore ECX-1000 comprising 4 ARM Cortex-A9 cores and 4 GB of DRAM running Ubuntu 12.04 LTS. The server can be configured with a different number of active nodes, which we denote by writing Viridis(x) where x is the number of active nodes. The performance mode frequency is 1.4 GHz and the powersave is 200 MHz.
Power Measurement
Power measurements are taken at two points along the path supplying current to the CPUs, as shown in Figure 1 . Each measurement point exhibits different characteristics [16] .
The path of the current supply to the CPU showing points at which we measured power.
The Power supply unit (PSU) converts the AC wall socket supply to DC, but can be up to 30% inefficient. The voltage regulator (VRM) stabilizes the DC supply before it reaches the CPU. The exact form of the current supply path differs from one platform to the next but to provide a fair basis for comparison we identified two distinct points on the path which are measurable on both platforms. In all cases a WattsupPro multimeter measures the power at the point before the PSU, which we label PRE-PSU, giving a value that corresponds to the true economic cost of operating each platform. Equating the Intel Xeon package approximately to the ARM SoC means that the PRE-VRM point is also a suitable place in the path at which to take comparative measurements. For the Intel server, PRE-VRM measurement is facilitated by reading the Running Average Power Limit (RAPL) counters while the same functionality on Viridis is available through the Intelligent Platform Management Interface (IPMI) counters. 
EXPERIMENTS AND RESULTS

Experiments performed
Stock price data was recorded for a full session of normal trading in July 2014 and subsequently used to perform all experiments. By normal we mean that no significant initial public offerings (IPOs) or other skewed trading patterns took place on the market that day. We compiled and then executed our OptionPricer on our laboratory servers and configured the installation to compute prices using MC and BT models for the 617 option contracts defined on the Facebook stock at the time. Each instance of the OptionPricer used four compute threads thus fully subscribing the CPUs on both the Intel and Viridis platforms. Both models are numerical approximations to the Black-Scholes analytical model, converging to this value as N → ∞. We selected three values of N for the MC computation 0.5M, 1M, 2M confirming that the relative error with respect to the BlackScholes result decreased progressively to less than 0.1%. We found that to achieve comparable results with the BT model required N equal to 4000, 5000 and 7000 respectively. Fixing on these values for N for all simulations therefore allows comparison of metrics both within and between the models.
For all experiments reported here, we replayed the collected data onto UDP multicast channels in our lab using tcpreplay. This enabled us to conduct two types of experiments:
Standalone kernel experiments For these, a single price update for one stock is pushed onto the multicast channel where it is read by each instance of the OptionPricer. All defined option contracts for that stock are then computed. Typically there are only a few hundred contracts open for one stock, meaning that these experiments are relatively short to run. These experiments allow precise investigation of the J/option and Time/option metrics.
Session-wide experiments
In these experiments we replay all stock price updates for a complete trading session onto a UDP multicast channel. These experiments allow to analyze application performance using the QoS metric. Results also include energy consumption for the whole session using PRE-PSU power measurements and reported as the TCO in KWh. Moreover, we show data comparing iso-QoS and iso-power cross-platform configurations in terms of energy efficiency and performance. A cross-platform comparison is iso-QoS when the platforms compared have roughly the same value for the QoS metric whereas iso-power means they have approximately the same power consumption.
Standalone kernels
In our first experiment we investigate the effect of performance and powersave modes to energy consumption. Table 1 shows average power, compute time and and energy per option for each platform, executing the MC 1M and BT 5000 kernels, measuring the PRE-VRM power. Comparing the J/opt values between performance and powersave modes, shows that performance mode is more energy efficient for both platforms and for any kernel. This is because computation latency under powersave execution increases disproportionately to power savings. Specifically, on the Intel server, the PRE-VRM power consumption on powersave mode is about 2/3 of the performance mode one. By contrast, computation latency doubles, thus negating any power savings. Note that in powersave mode the Intel CPU runs at 1.2 GHz, which is half the performance mode frequency. This directly translates to a twofold increase in computation latency. On the Viridis, the increase in J/option in powersave mode is even more pronounced. Powersave power consumption is again about 2/3 of the power consumption in performance mode, but computation latency surprisingly increases by an order of magnitude. The Viridis' powersave frequency (200 MHz) is 1/7 of the performance mode frequency (1.4 GHz). However, this does not translate to proportional power savings because the ARM SoC includes other components which are not controllable through the DVFS interface. Following our findings and for the rest of our experiments, processors are set in performance mode, which is the most energy efficient. Next we compare the platforms when they are operating in performance mode. The J/opt of the Viridis(1) configuration is on par to Intel when executing BT 5000. However, Viridis(1) has about 25% more energy consumption per option compared to Intel when executing MC 1M. This is because the MC calculation involves transcendental functions that execute algorithmically on ARM whereas Intel provides specialized instructions which assist in computing them. The MC calculation performance gap between Viridis and Intel, indicated by the S/opt metric, is greater than the BT kernel one, hence the J/opt increases.
In this paragraph we discuss in more detail energy efficiency, varying the number of iterations for the financial kernels and scaling out the Viridis to 16 nodes. Table 2 shows the results for the various configurations used. In this table we show the total elapsed time instead of S/opt. Elapsed time is a more representative metric for scale-out performance. S/opt quantifies the computational latency for pricing a single contract and remains constant for the same platform and kernel configuration, regardless of the number of active nodes. On the other hand, the total elapsed time decreases under scaling, roughly linearly for both our kernels, due to multi-node, parallel computation. Also, note that the J/opt values for Viridis (1) and Viridis (16) are about the same. Scaling out Viridis nodes indeed reduces the elapsed time linearly but power consumption increases linearly too since more nodes are active, thus the J/opt ratio remains constant.
Examining the MC algorithm, Intel has a lower J/opt value for all values of N. The most significant difference in energy efficiency occurs when N is set to 2M. In this case Viridis consumes about 40% more energy per option than Intel. As discussed earlier, the MC model is completely dominated by a for-loop using transcendental function evaluation and with minimal setup costs beforehand. The Intel computes these functions in hardware and performs faster, resulting in an overall energy metric of 1.37 J/opt compared to 1.96 J/opt for Viridis. Considering the BT model, the Intel and the Viridis have similar values for the J/opt metric when N is set to 4000 or 5000 iterations. Interestingly, for the largest iteration case, when N is 7000, Intel consumes roughly 70% more energy per option than Viridis. This can be explained by the fact that the BT model has two computational phases. The first phase performs setup computations in linear complexity, involving transcendental mathematical functions, which can run hardware accelerated on Intel. The second phase is the tree scanning phase, which has quadratic complexity, using only add and mul floating point operations. These operations execute natively on both platforms. On the first phase, the Intel does better than the Viridis in terms of J/opt but as N increases, the tree scanning phase dominates the execution time so that the Viridis outperforms the Intel in J/opt. The performance scalability of the Viridis for both kernels is evident. Scaling Viridis to 16 nodes results in 14× up to 16× reduction in the overall elapsed time while maintaining approximately the same J/opt metric in all cases.
The previous results report metrics using the PRE-VRM power and show that the BT model exhibits a better J/opt metric on the Viridis and the MC model on the Intel. Following we show results using PRE-PSU power measurements which better reflect actual economic cost. Table 3 summarizes those results. Viridis(16) has a significantly better J/opt value compared to Intel for both kernels. Specifically, Viridis(16) consumes as little as about 60% of the energy per option of Intel. Notably, the poor figures for Viridis(1) reflect the fact that substantial power is consumed by components within the chassis regardless of the number of active nodes. In more detail, Viridis(16) PRE-PSU power consumption is only 1.75× the consumption of Viridis(1), despite scaling out 16-fold node occupancy in the box. This is because PRE-PSU power includes the constant power consumption of system components, such as the node interconnect, needed for multi-node integration on Viridis. Scalingout Viridis nodes amortizes this power consumption, suggesting that Viridis becomes more power efficient as the number of computation nodes increases. 
Session-wide experiments
The previous section has quantified both the J/opt and the Time/opt metrics in an ideal case by considering a standalone price update. However, in a live market situation the question arises of whether this processing can keep pace with the rate of arrival of new stock prices. When a new Figure 3 : All-or-nothing pricing vs. stock price update intervals stock price arises, this event triggers a computation of all available option contracts. The most strict requirement is all-or-nothing: the computation of all options must complete before a stock price update, otherwise it is deemed as failed. However, this requirement is too stringent to meet. Indicatively, Figure 3 shows the percentage of successful all-or-nothing computations as a function of the cumulative number of price updates, sorted into bins of 0.25 second intervals. The data for our session-wide experiments are taken from a trading session of 6.5 hours where 10156 price updates occurred for the Facebook stock. Assuming the MC and BT kernels execute the least number of iterations, only Viridis(16) is able to partially meet the all-ornothing requirement, having a total elapsed time of around 2.9 seconds.
Nonetheless, there are trading scenarios where the financial user does not require this all-or-nothing approach. For example a user may prefer to price a subset of options, such as short-term or long-term expiring ones, for which there is financial interest. Moreover, the all-or-nothing approach disregards the fact that some options were indeed correctly priced and they can be used for trading. For these reasons we relax this requirement and define our QoS metric as the percentage of successfully priced option contracts over the total number of option contract pricings. Table 4 records the measurements for the QoS metric, the absolute number of succesful option pricings and TCO for various platform configurations running the session-wide experiment for the MC 1M kernel. Results for the BT kernel are similar. Also, Table 4 includes two extrapolated entries for the Intel platform (marked by a star) assuming optimistically that power and performance scales linearly for Intel. We use the Intel(x) notation for denoting active nodes for the Intel too. These extrapolated entries provide estimation figures for comparing Intel to scaled-out Viridis. Note that PRE-PSU measurements for Intel do not include any power consumption of system components needed for integrating multiple nodes, such as the interconnect, as is the case for Viridis. Hence, linear scaling of performance and power for the Intel is an optimistic estimation.
From the measurements it follows that Viridis(16) has the best QoS metric, timely pricing 41.5% of the options. Looking at the scale-out Viridis configuration, the QoS metric scales linearly as more Viridis nodes are active, while power consumption scales at a significantly lower rate. As noted earlier, this is because PRE-PSU measurements for Viridis include the power consumed by system integration components. Comparing Viridis (16) to Intel(1), Viridis(16) has about double the energy consumption (TCO) but its QoS metric is 3× that of Intel(1).
Next we discuss the results on platform configurations that are iso-QoS, having similar values for the performanceoriented QoS metric, or iso-power, having similar power consumption. Observing the measurements, Viridis(4) is about iso-QoS to Intel(1), having a QoS of 10.4% opposed to 13.2% of Intel(1). However, Viridis(4) has roughly 30% more power and energy consumption than Intel(1). Scaling out Viridis to 4 nodes is not sufficient to outperform Intel. Using the extrapolated entries, Intel(2) is about iso-power with Viridis(16) which has the highest QoS value. Interestingly, Intel(2) has a much lower QoS value of 26.5% timely priced options compared to the 41.5% value of Viridis (16) . This result suggests that under the same power budget, the scaled-out Viridis can outperform Intel. Moreover, Intel(3) is about iso-QoS to Viridis(16) but it consumes roughly 30% more power and energy. This suggests that Viridis is more energy efficient to Intel when the performance objective is fixed.
RELATED WORK
Several recent studies explore the performance and power consumption of ARM processor models that hold promise for entering the server market [7, 8] . and the HPC market [5] . We focus our study in the domain of financial real-time an- alytics applications and further provide a comparison between fully integrated ARM-based microservers in scale-out configurations and a high-end Intel server. Our study thus provides more insight on the viability of the ARM ecosystem for datacenters and high-performance computing. The work of Blem et al [17] is the closest to ours, as it explores the performance and power consumption of several ARM and Intel processors. However, their study focuses on the energy and performance implications of the ISA choice on the different processors, rather than the use of the processors at scale to equip datacenters and their deployment in specific application domains. There is relatively little recent work performed on the measurement of energy consumption for specific numerically intensive kernels, particularly on server platforms. Alonso et al [18] model the power and energy of a specific taskparallel implementation of Cholesky factorization. Dongarra et al [19] explore the energy footprint of dense numerical linear algebra libraries on multicore systems. Kozin [20] analyzed the execution of several computational chemistry codes on Nehalem servers and on a heterogeneous system composed of one Nehalem server and one Tesla GPU system. They reported that idle power was a significant component of overall power consumption. Our work evaluates the power and energy of option pricing kernels both in standalone mode and in the context of end-to-end market sessions to obtain a complete view of the implications of the choice of server processors on total cost of datacentre ownership.
THE NANOSTREAMS PROJECT
The work reported in this paper has been carried out within the wider context of our Nanostreams project 1 . The project bridges the performance gap between microservers and large servers by enhancing microservers with applicationspecific, energy-efficient and programmable accelerators. The project is building a heterogeneous microserver with a host SoC and an analytics accelerator SoC, with a total power budget under 10 Watts, where a performance-equivalent system with state of the art server-class processors would consume about 170 Watts. NanoStreams achieves its goals by adopting a scale-out approach where multiple microservers are densely replicated and packaged to build systems with equivalent performance of large-scale servers but a dramatically smaller form factor. A central feature of this is a co-designed software stack providing elastic and scale-free co-execution of parallel workloads and central to this is ease of programming by devel-opers. In this paper we have demonstrated the viability of microservers based on the now outphased Calxeda ECX-1000 SoC and the Cortex A9 core. We will be evaluating more recent ARM-based SoCs based on 64-bit cores, with GPU or FPGA accelerators in future work.
CONCLUSIONS
In this paper we have presented a methodology for comparing server platforms for real-time financial analytics. Our methodology considers performance, energy-efficiency, QoS, and impact of energy consumption on total cost of ownership. We applied two option pricing models on real market data. We defined new platform-independent metrics, (Joules/Option, iso-QoS, iso-power), to permit fair comparisons of the servers under different scenarios of interest to the datacenter platform provider and the end user. We also analyzed the computation profile of two important financial kernels, the Monte Carlo and Binomial Tree models, in relation to each server's specific features.
Our results show that microservers based on ARM processors are viable tools for numerically intense, pre-trade risk computations. Scaled-out microserver configurations improve throughput linearly, resulting in higher quality of service, while power consumption scales modestly due to the integrated microserver form factor. Based on our experimental results with real market data for a trading session, microservers promise up to 30% less energy compared to a standard HPC server while providing the same performance.
ACKNOWLEDGEMENTS
The NanoStreams project is funded by the European Comission under its Seventh Framework Programme as contract number 610509. This work was also partially supported by the UK Engineering and Physical Sciences Research Council, under grants EP/K017594/1, EP/L000055/1 and EP/L004232/1.
