Algorithmic Trading: A brief, computational finance case study on data
  centre FPGAs by Inggs, Gordon
AAlgorithmic Trading:
A brief, computational finance case study on data centre FPGAs
GORDON INGGS, Circuit and Systems Research Group, Department of Electrical and Electronic
Engineering, Imperial College London
Increasingly FPGAs will be deployed at scale as a result of the need for increased power efficient computa-
tion and improved high level synthesis tool flows, creating a new category of device: data centre FPGAs. A
method for using these FPGAs is to identify what proportion of a given workload would benefit from being
implemented upon the available FPGAs while minimising communication off-chip. As part of the imple-
mentation of these tasks, care should be taken in identifying the parallel execution mode, task or pipeline
parallelism that should be used. When considering a case study of computational finance tasks, a benchmark
workload of Heston and Black-Scholes-based options implemented using OpenCL and OpenSPL, the benefit
of this method of using data centre FPGAs is illustrated. These devices deliver latency performance close
to that of workstation grade GPUs, while requiring considerably less energy, resulting in 30% more floating
point operations per Joule of energy consumed.
ACM Reference Format:
Gordon Inggs, 2016. Algorithmic Trading: A brief, computational finance case study on data centre FPGAs
ACM V, N, Article A (January YYYY), 13 pages.
DOI:http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
1.1. Why FPGAs are coming to a data centre near you
I argue that Field Programmable Gate Arrays (FPGAs) will be increasingly deployed
at scale, i.e. in large data centres and available from Infrastructure-as-a-Service (IaaS)
providers, due to recent trends in both computing hardware and software.
Firstly with respect to hardware, the ramifications of the end of Dennard scaling are
on-going. While the turn to multicore Central Processing Unit (CPU) architectures has
continued power efficiency gains, there has been a significant slowdown in the rate of increase
of computational power efficiency since 2000 [Koomey et al. 2011].
As a result, alternative computational architectures such as Graphics Processing Units
(GPUs) and FPGAs are proliferating. These architectures provide orders of magnitude
better power efficiency than even the latest multicore CPUs. This power efficiency comes
by virtue of architectural specialisation on parallel execution. In the GPU case, this is in
the form of Single Instruction, Multiple Data (SIMD) parallelism, while FPGAs can provide
fine-grained pipeline parallelism by creating custom architectures.
The performance and economies of scale attached to GPUs make them attractive for data
centre deployment, and indeed, several IaaS providers offer GPU resources. However, GPUs
are typically power intensive. So, while offering high throughput performance, this comes
at increased power consumption relative to a server grade CPU, and more importantly,
Author’s addresses: Gordon Inggs, Circuits and Systems Research Group, Department of Electrical and
Electronic Engineering, Imperial College London, London, United Kingdom.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or commercial advantage
and that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA,
fax +1 (212) 869-0481, or permissions@acm.org.
c© YYYY ACM 0000-0000/YYYY/01-ARTA $15.00
DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
ar
X
iv
:1
60
7.
05
06
9v
1 
 [c
s.D
C]
  1
0 J
an
 20
16
A:2 G. Inggs
increased heat generation which increases the data centre infrastructure costs. FPGAs,
by contrast, are much less power intensive, and can often be passively cooled, even when
deployed at scale.
Secondly with respect to software, a significant development is the widening pool of
programmers that can potentially use FPGAs thanks to the of High Level Synthesis (HLS)
tools. As I discussed in previous work [Inggs et al. 2014], it is now quite possible to produce
FPGA implementations that are competitive with other HPC approaches.
Hence, in the past users targeting FPGAs were frequently referred to as designers, as the
process of using the devices was much more akin to hardware design. However, with the
recent advances in HLS, it is increasingly becoming indistinguishable from programming,
albeit with very long compile times. Hence, in this note, I refer to users of FPGAs as
programmers.
1.2. Data Centre FPGAs
In the previous subsection I argued that it is increasingly attractive to deploy FPGAs en-
masse to data centres. IaaS providers could then make these FPGAs available as virtualised
compute resources, similar to their current multicore CPU and GPUs offerings.
I suggest that the availability of FPGAs in the cloud will result in a distinct category
of data centre FPGAs. Arguably, the offerings from the two largest FPGA vendors, Xilinx
and Altera, are already bifurcated between large devices intended for high performance
computing, such as the Virtex and Stratix lines of chips, and smaller, embedded application-
focused devices, such as the System-on-Chip Zynq and Cyclone lines.
In this note, I consider data centre FPGAs to have the following three characteristics:
(1) Size - being one of the largest devices available for that process technology.
(2) Hosted - programmed and controlled by, and as well capable of communicating with a
conventional host CPU. A single host CPU may manage multiple data centre FPGAs.
(3) Scalable - part of a modular system which can be replicated many times over.
Note, that the characteristics above do not preclude the FPGA from having the capability
to communicate with other systems, besides its host CPU.
1.3. Questions for Data Centre FPGAs
I believe there to be two key research questions for data centre FPGAs.
Firstly, what should the FPGAs be used for, relative to other data centre computing
resources such as multicore CPUs and GPUs?
Secondly, how should the FPGA resources be used, given the opportunity for different
forms of parallel execution?
Both research questions are the subject of much ongoing research, however in this note I
largely focus on the second question.
1.4. Contributions
In this note, I make the following three distinct contributions in determining how data
centre FPGAs should be used:
(1) A methodology for using data centre FPGAs, both in terms of what portion of the
task should be implemented on the FPGA as well how the work should be implemented
upon the device.
(2) A case study, applying the methodology I outline to the real world problem domain of
option pricing from computational finance.
(3) An evaluation of this methodology, using the case study. I describe the efficiency of
data centre FPGAs from three leading providers, using two heterogeneous programming
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:3
Profile Workload
Assess for FPGA
Implement on FPGA
Receive 
workload
Return results
Execute on CPU/GPU/FPGA
Check 
Implementation 
availablility
Execute on CPU/GPU
Not suitable
Suitable
Yes
No
Fig. 1: Proposed data centre FPGA method.
standards. I also compare these devices to a multicore Intel CPU and GPUs from both
major vendors, AMD and NVIDIA.
1.5. The rest of the note
The remainder of this note describes my suggested approach to using data centre FPGAs,
and a preliminary evaluation of the approach that I outline. In the next section, I describe
the approach.
2. USING DATA CENTRE FPGAS
2.1. What data centre FPGAs should be used for
I propose the following method for using data centre FPGAs, derived from [Braun et al.
2001]’s model for using heterogeneous computing systems, as illustrated in Figure 1:
— Receive Workload : the workload of computational tasks are specified by the programmer,
possibly using a general purpose programming language.
— Profile Workload : the computational tasks are analysed with respect to the different
parallel execution modes. This could either be done by running a subset of the tasks,
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4 G. Inggs
or a static analysis comparing the tasks to heterogeneous computing benchmarks [Wang
et al. 2014], such as Rodinia [Che et al. 2009].
— Assess for FPGA: based upon the workload profiling, an assesment needs to be performed
to decide whether FPGAs should be employed, and if so, for which tasks or proportion
of the tasks should be executed on the FPGAs.
— Execute on CPU/GPU : if the workload is found to not be suitable for FPGA execution,
or if it needs to be implemented upon the FPGA, some of the task can be implemented
upon available CPU or GPU resources, provided doing so doesn’t exceed programmer
constraints such as cost.
— Check implementation availability : if FPGAs are being used, implementations of the task
for FPGAs need to be found.
— Implement on FPGA: if there is no FPGA implementation available, one needs to be
generated. The implementation should seek to achieve the programmers objectives such
as latency minimisation as efficiently as possible.
— Execute on CPU/GPU/FPGA: when the workload executing on FPGAs, other avail-
able resources should also be used, using the complimentary strengths of all, as I have
described in other work [Inggs et al. 2015].
— Return results: the results are returned, in the form expected by the programmer.
The degree of automation in this method is left up to the discretion of the system imple-
menter.
2.2. How to make the most of data centre FPGAs
In the method outlined above, there are many open questions, such as the best method for
profiling workloads or how to partition tasks between heterogeneous computing platforms,
or even the overall question how of such a method should be abstracted to programmers.
In this note however, as noted in Section 1.3, I am primarily concerned with proposing
this method, and considering the implementation of tasks upon FPGAs.
As demonstrated in my previous work [Inggs et al. 2014], it is increasingly possible to use
HLS flows to implement tasks upon FPGAs, using open standards such as OpenCL [Stone
et al. 2010] and OpenSPL [ope 2013]. However, these standards only guarantee functional
correctness, leaving any optimisation of the platform up to the programmer.
However, due to the inherent flexibility of FPGAs, the programmer is still faced with a
choice between different forms of parallelism. I suggest that the programmer should evaluate
many possible architectures, and select that which offers the highest power efficiency, i.e.
the most computational effort for the energy expended.
3. DERIVATIVES PRICING CASE STUDY
3.1. Background
3.1.1. Option Pricing. Computational finance is an important activity in modern commerce.
The problems in the area are concerned with the modelling of uncertainty or risk. Derivatives
pricing is one of the largest activities in this area, with ≈ $100 trillion of derivatives products
currently active. Derivative pricing is also computationally intensive, and as a result is a
major consumer of high performance computing, including multicore CPUs, GPUs and
increasingly, FPGAs.
An example of a derivative is an option contract. An option is an agreement where a
holder pays a premium to the writer in order to obtain rights with regards to an underlying,
an asset such as a stock or commodity. This right either allows the holder to buy or sell the
underlying at a defined strike price at a defined exercise time.
The holder has bought the right to exercise the transaction if they so choose, and is in
no way obligated to so. In derivatives pricing, the intrinsic value of the option is the payoff,
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:5
Mean
Spot
Price
Distribution
Payoff
Time
Pr
ice
Strike Price
Expiration 
Time
Fig. 2: Overview of Monte Carlo derivatives pricing.
the difference between the strike price and spot price of the underlying at the exercise time,
or zero, whichever is higher [Hull 2011].
3.1.2. Monte Carlo-based Option Pricing. The popular Monte Carlo technique for option pric-
ing uses random numbers to create scenarios or simulation paths for the underlying based
upon a model of its spot price evolution. The average outcome of these paths is then used
to approximate the payoff [Hull 2011], i.e.
Vt = e
−r(T−t)
∫
w
V (w)dP(w) ≈ e−r(T−t) 1
N
N−1∑
i=0
V (Si),
where Vt is the current value of the option, e
−r(T−t) the discount factor, P (w) the prob-
ability space defined by the underlying asset and Si the price of the asset. I have provided
an illustration of Monte Carlo option pricing in Figure 2.
Although computational expensive, the Monte Carlo pricing technique is robust, capa-
ble of tolerating underlying models with many more stochastic variables than competing
methods [Hull 2011].
3.1.3. Computational Implementation. The Monte Carlo option pricing algorithm can be ex-
pressed using the MapReduce computational design pattern [Dean and Ghemawat 2004].
The simulation paths comprise the map operation, while the reduction is the average of
the payoffs. I have described this pattern in C code in Listing 1, with MAP and REDUCE
labels.
Listing 1: Monte Carlo Option Pricing as MapReduce
MAP:
for(i=0;i<PATHS ;++i){
state = path_init(seed ++);
for(j=0;j<PATH_POINTS ;++j) state = path(state);
value[i] = payoff(state );
}
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6 G. Inggs
REDUCE:
for(i=0;i<PATHS ;++i) result += value[i]/ PATHS;
This expression of the algorithm highlights another advantage of the Monte Carlo ap-
proach - it is extremely amenable to parallel execution, as each simulation, i.e. each iteration
of the outer loop in the MAP section, can be performed in parallel. In fact, it is considered
to the canonical “Embarrassingly Parallel” algorithm [Asanovic et al. 2006].
3.2. Data centre FPGA Derivatives Pricing
The C description in Listing 1 is useful when deciding how Monte Carlo derivatives pricing
should be mapped to a data centre FPGA.
The majority of the computational work is clearly in the MAP code section. In particular,
the path function call requires the generation of multiple Gaussian random numbers. Hence,
this suggests that the FPGA should be specialised on this.
A second consideration is minimising communication between the data centre FPGA and
the host CPU. Between the MAP and REDUCE sections, only the resulting value of each
simulation need to be communicated. However, if the code is segmented in any other way,
considerably more communication would be required. Hence, this suggests that the FPGA
should be used for the MAP, while the host CPU should be used for the REDUCE code
section.
3.3. Optimising FPGA Derivatives Pricing
Having decided that it is the simulation paths or MAP code section in Listing 1 that must
be implemented upon the FPGA, the next question is how the task should be implemented
upon the FPGA. Below I have described two optimisations that might be applied, in terms
of two leading programming standards supported by HLS flows, namely OpenCL and Open-
SPL.
3.3.1. Task Parallelism. I have illustrated task parallelism in code Listing 2 by introducing a
third parallel loop bound by P . Each iteration of the outer loop can be performed completely
independently.
Listing 2: Identifying Task Parallelism
for(p=0;p<P;++p){
for(i=0;i<PATHS/P;++i){
state = path_init(seed ++);
for(j=0;j<PATH_POINTS ;++j)
state = path(state );
offset = p*PATHS/P;
value[offset + i] = payoff(state );
}
}
In OpenCL, task parallelism is inherent to the standard. The programmer expresses their
tasks as instances or work-items of programs or kernels. Architecturally, the degree of task
parallelism available is captured by the number of compute units available in the device.
Hence, the Monte Carlo code in Listing 2 could be used, without the outer, parallel loop,
and the number of work-items set to P .
In OpenSPL, task parallelism is expressed using an architectural loop, which creates
multiple copies of loop body. Hence code very similar to Listing 2 must be used, with the
outer loop being the architectural description.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:7
Table I: Configuration of Experimental Data Centre FPGA Platforms
Designation Name
(FPGA)
Communication
Technology
Programming Standard
(tool)
P385-D5 Nallatech P385-D5
(Stratix V 5SGSD5) PCIe OpenCL
(Altera OpenCL SDK 15.0)
C5-SoC Altera Cyclone 5 SoC
(Cyclone V 5CSXC6) AXI
Max3 Maxeler Max 3424A
(Virtex 6 XC6VSX475T) PCIe OpenSPL
(Maxeler MaxCompiler 14.1)
Max4 Maxeler Max 4
(Stratix V 5SGSD8) Ethernet
3.3.2. Pipeline Parallelism. In the naive formulation of the code, there is already ample op-
portunity for pipeline parallelism, as each computational operation could be viewed as a
stage in the pipeline. However, the iterations of inner loop bounded by PATH POINTS are
data dependent, and hence would stall any pipeline generated in this naive fashion.
Hence, in order to further extend pipeline parallelism, I have unrolled the inner loop in
my code, as demonstrated in Listing 3.
Listing 3: Doubling the potential Pipeline Parallelism
for(i=0;i<PATHS ;++i){
state = path_init(seed ++);
for(j=0;j<PATH_POINTS /2;j+=2){
state2 = path(state);
state = path(state2 );
}
value[i] = payoff(state );
}
The core OpenCL standard doesn’t provide a means to express pipeline parallelism,
beyond the manual unrolling performed in Listing 3. However vendors such as Altera provide
code source code pragmas which allow for loop unrolling.
Similarly, OpenSPL also doesn’t provide a native means to express pipeline parallelism,
beyond the manual form described in Listing 3.
4. EVALUATION
4.1. Experimental Setup
In this subsection, I describe how the experimental platforms used, the tasks used in the
evaluation, and finally how the results measured were captured.
4.1.1. Experimental Platforms. An overview of the experimental platforms used is given in
Table I, with the details of the FPGAs devices given in Table II.
The three references platforms used are detailed in Table III.
4.1.2. Option Pricing Tasks. An overview of the 5 option pricing tasks are given Table IV.
These tasks are drawn from the Kaiserslautern Option Pricing benchmark [de Schryver
et al. 2011], as well as the work from Imperial College London on pricing Black-Scholes
Model-based Asian options [Tse et al. 2011].
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8 G. Inggs
Table II: Experimental FPGA Resources
FPGA Vendor
CMOS
Size
(nm)
Targeted
Clockrate
(MHz)
LookUp
Tables
(LUTs)
Flipflop
Registers
(FFs)
Block
RAMs
(BRAMs)
DSPs
Stratix V 5SGSD5 Altera 28 250 457k 690k 2014 1590
Cyclone V 5CSXC6 Altera 28 250 110k 41.51k 557 112
Virtex 6 XC6VSX475T Xilinx 40 200 298k 595k 1064 2016
Stratix V 5SGSD8 Altera 28 180 695k 1050k 2567 1963
Table III: Comparison of Reference Platforms
Platforms
CMOS
Size
(nm)
Clockrate
(GHz)
Memory
(GBs) Threads
Tool
Intel Core i7-2600S 32 2.8 16 8 GCC 4.8
AMD Firepro W5000 28 0.825 2 768 AMD OpenCL SDK 2.9
NVIDIA Quadro K4000 28 0.81 3 768 NVIDIA OpenCL SDK 7.0
Table IV: Overview of Experimental Option Pricing Tasks
Designation Underlying Option
Complexity
( FLOPSimulation )
he-eu Heston European 323590
he-ba Heston Barrier 327686
he-do Heston Double Barrier 331780
he-di Heston Digital Double Barri 331781
bl-as Black Scholes Asian 147462
Each task was performed with 10 million simulation paths, with 4096 path points in each
path.
4.1.3. Result Measurement. Two metrics were used in this study:
— Latency : the latency reported is wall-clock time, i.e. the absolute time passed from task
initialisation to the result being returned, as reported by an external time reference. In
all cases the system time is used, which is set using the Network Time Protocol.
— Energy : the energy figures reported are based upon total system power, as measured
using an Olson inline power meter. The power meter was polled regularly, with the time
since the last measurement used to calculate the energy consumed in the interval.
4.2. Experimental Results
The latency results for the experimental tasks given in Table IV, run upon experimental
platforms, described in Tables I and III, are given in Table V. The energy results are given
in Table VI.
The resource use for FPGA platforms in Table I are given in Table VII.
Results of note are the relative high base power consumption of the Maxeler Max4 plat-
form, which is approximately 240W when idling, compared to the 69W idling power of the
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:9
Table V: Full Experimental Latency Results in seconds. Platform and Task designations are
given in Tables I and IV. base refers to the baseline implementations, tp and pp to a task
and pipeline parallel implementations. The lowest values for each task are given in bold.
Platform
he-eu he-ba he-do he-di bl-as
base tp pp b tp pp base tp pp base tp pp base tp pp
AoC P385-D5 180 24 20 178 27 22 176 26 22 169 26 22 189 13 11
AoC C5-SoC 393 - - 347 - - 347 - - 370 - - 343 - -
Maxeler Max3 212 25 32 213 25 30 213 24 30 213 25 30 217 12 23
Maxeler Max4 235 24 43 236 22 43 236 20 43 236 24 43 243 13 49
Intel i7-2600S - 1295 - - 611 - - 716 - - 742 - - 517 -
AMD W5000 - 8 - - 161 - - 95 - - 111 - - 6 -
NVIDIA K4000 - 14 - - 16 - - 17 - - 17 - - 12 -
Table VI: Full Experimental Energy Results in kilojoules. Platform and Task designations
are given in Tables I and IV. base refers to the baseline implementations, tp and pp to a
task and pipeline parallel implementations. The lowest values for each task are given in
bold.
Platform
he-eu he-ba he-do he-di bl-as
base tp pp b tp pp base tp pp base tp pp base tp pp
AoC P385-D5 13.1 2.0 1.7 12.8 2.2 1.9 12.7 2.1 1.9 12.2 2.2 1.9 13.6 1.1 0.9
AoC C5-SoC 6.7 - - 5.9 - - 5.9 - - 6.3 - - 5.6 - -
Maxeler Max3 14.9 1.9 2.5 14.9 2.0 2.4 14.8 2.0 2.4 14.8 2.0 2.4 15.0 0.9 1.8
Maxeler Max4 58.3 6.2 11.1 59.1 5.7 10.9 58.8 5.3 11.0 59.1 6.2 11.0 60.5 3.4 12.3
Intel i7-2600S - 74.5 - - 70.5 - - 71.7 - - 70.1 - - 64.2 -
AMD W5000 - 0.8 - - 16.1 - - 9.9 - - 11.4 - - 0.6 -
NVIDIA K4000 - 1.7 - - 2.2 - - 2.4 - - 2.4 - - 1.6 -
other platforms. Another consideration is relatively small size of the Altera C5-SoC FPGA,
which did not allow for any optimisations to be applied.
4.3. Discussion
4.3.1. Using Data Centre FPGAs Efficiently. In Figures 3, 4 and 5 I have plotted latency
performance for the P385-D5, Max3 and Max4 platforms against a sequential CPU, as
a function of power and device resource use for both the task and pipeline parallelism
optimisations. I haven’t plotted the C5-SoC platform, as no optimisation could be supported
on the platform.
In all cases, the baseline implementations show some improvement on the sequential CPU,
despite the clockrate of the FPGAs being an order of magnitude less than the reference
platform. I posit that this improvement is due to the inherent opportunity for pipelining
in the application task, as well as the improvement due to the specialisation of the FPGA
architecture.
The results for Nallatech P385-D5 platform, given in Figure 3, which uses the Altera
OpenCL SDK shows that the pipeline parallelism optimisation results in a more efficient,
improvement over the sequential implementation. This is unexpected, given the apparently
task parallel nature of the OpenCL standard.
However, the Altera OpenCL SDK transforms the task parallelism of OpenCL into
pipeline parallelism, hence any improvement in the opportunity for pipeline parallelism
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10 G. Inggs
Table VII: Full Experimental resource use as percentages. Platform and Task designations
are given in Tables I and IV. base refers to the baseline implementations, tp and pp to a
task and pipeline parallel implementations. The highest values for each task are given in
bold.
Platform Resource
he-eu he-ba he-do he-di bl-as
base tp pp b tp pp base tp pp base tp pp base tp pp
AoC P385-D5
LUT 22 58 49 23 62 60 23 62 61 23 63 61 20 72 53
FF 15 45 39 16 47 46 16 48 46 16 48 46 13 53 36
BRAM 25 54 44 25 54 51 25 54 52 25 54 52 23 67 37
DSP 6 34 27 6 37 46 6 37 46 6 37 46 3 39 34
AoC C5-SoC
LUT 36 - - 38 - - 38 - - 38 - - 29 - -
FF 24 - - 25 - - 25 - - 25 - - 17 - -
BRAM 36 - - 37 - - 37 - - 37 - - 30 - -
DSP 79 - - 88 - - 88 - - 88 - - 46 - -
Maxeler Max3
LUT 13 79 69 13 81 70 13 81 71 13 81 71 8 80 61
FF 13 90 75 13 91 78 13 91 78 13 91 78 8 91 65
BRAM 6 50 40 6 50 44 6 50 44 6 50 44 2 43 33
DSP 4 16 14 4 16 15 4 16 15 4 16 15 3 18 12
Maxeler Max4
LUT 10 82 57 10 82 57 10 82 57 10 82 57 6 85 45
FF 10 89 57 10 90 57 10 90 57 10 89 57 6 84 43
BRAM 3 36 24 3 36 24 3 36 24 3 36 24 1 34 19
DSP 10 91 49 11 91 50 11 92 50 11 91 50 7 96 43
0 10 20 30 40 50 60 70 80 90 100
Device Resource Use (%)
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(a) Device Use
70 75 80 85 90 95 100
Average Power Use (W)
1
5
10
15
20
25
30
35
40
45
50
55
60
65
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(b) Power
Fig. 3: Nallatech P385-D5 Mean Latency Improvement as a function of Device Use and
Power.
is realised. By contrast, task parallelism replicates compute units, apparently adding addi-
tional overhead.
The results for Maxeler platforms, given in Figures 4 and 5, which use the OpenSPL
standard, show that task parallelism optimisations result in a more efficient use of the device
and power. Although the benefit over pipeline parallelism optimisations is less pronounced
than in the P385-D5 case.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:11
0 10 20 30 40 50 60 70 80 90 100
Device Resource Use (%)
1
5
10
15
20
25
30
35
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(a) Device Use
65 70 75 80 85 90 95 100
Average Power Use (W)
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(b) Power
Fig. 4: Maxeler Max3 Mean Latency Improvement as a function of Device Use and Power.
0 10 20 30 40 50 60 70 80 90 100
Device Resource Use (%)
1
5
10
15
20
25
30
35
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(a) Device Use
245 250 255 260 265 270
Average Power Use (W)
1
5
10
15
20
25
30
35
40
45
50
55
A
cc
e
le
ra
ti
o
n
 o
ve
r 
S
e
q
u
e
n
ti
a
l 
C
P
U
Task Parallel
Pipeline Parallel
(b) Power
Fig. 5: Maxeler Max4 Mean Latency Improvement as a function of Device Use and Power.
Similar to the P385-D5, this result is somewhat unexpected, given the inherent orientation
of the OpenSPL standard to dataflow architectures, and hence pipeline parallelism. I believe
this is due to the increased potential for resource use within pipeline stages that increasing
task parallelism allows.
4.3.2. Using Data Centres FPGAs. Figure 6 compares all of the experimental platforms in
absolute terms. In terms of latency, as given in Figure 6a, the FPGA platforms are generally
competitive to the high performing GPU platforms. However, in terms of average power and
energy use, Figures 6b and 6c, data centre FPGAs demonstrate their advantage.
The utility of data centre FPGA is clearly illustrated when considering the computational
efficiency of all the platforms, as given in Figure 6d, where the larger FPGAs provide many
more operations per unit of energy.
A further point to consider is that Nallatech P385-D5 platform could accommodate a
further 3 similar boards, where as the Max4 host system could accommodate a further 7.
Doing so would significantly improve the power efficiency of both, as the energy cost of
the host system would be further amortised across the FPGAs. By contrast, the GPU host
systems could at most accommodate only one more GPU.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12 G. Inggs
A
o
C
 P
3
8
5
-D
5
 (
p
p
)
A
o
C
 C
5
-S
o
C
 (
b
a
se
)
M
a
xe
le
r 
M
a
x3
 (
tp
)
M
a
xe
le
r 
M
a
x4
 (
tp
)
In
te
l 
C
P
U
 (
tp
)
A
M
D
 G
P
U
 (
tp
)
N
V
ID
IA
 G
P
U
 (
tp
)10
1
102
103
L
a
te
n
cy
 (
s)
18
359
21 20
736
38
14
(a) Mean Latency
A
o
C
 P
3
8
5
-D
5
 (
p
p
)
A
o
C
 C
5
-S
o
C
 (
b
a
se
)
M
a
xe
le
r 
M
a
x3
 (
tp
)
M
a
xe
le
r 
M
a
x4
 (
tp
)
In
te
l 
C
P
U
 (
tp
)
A
M
D
 G
P
U
 (
tp
)
N
V
ID
IA
 G
P
U
 (
tp
)10
1
102
103
P
o
w
e
r 
(W
)
85
16
80
261
95 101
137
(b) Mean Average Power
A
o
C
 P
3
8
5
-D
5
 (
p
p
)
A
o
C
 C
5
-S
o
C
 (
b
a
se
)
M
a
xe
le
r 
M
a
x3
 (
tp
)
M
a
xe
le
r 
M
a
x4
 (
tp
)
In
te
l 
C
P
U
 (
tp
)
A
M
D
 G
P
U
 (
tp
)
N
V
ID
IA
 G
P
U
 (
tp
)10
3
104
105
E
n
e
rg
y 
(J
)
1581
6039
1709
5242
70126
3898
2056
(c) Mean Energy
A
o
C
 P
3
8
5
-D
5
 (
p
p
)
A
o
C
 C
5
-S
o
C
 (
b
a
se
)
M
a
xe
le
r 
M
a
x3
 (
tp
)
M
a
xe
le
r 
M
a
x4
 (
tp
)
In
te
l 
C
P
U
 (
tp
)
A
M
D
 G
P
U
 (
tp
)
N
V
ID
IA
 G
P
U
 (
tp
)10
1
102
103
104
P
e
rf
o
rm
a
n
ce
 (
M
F
L
O
P
/J
)
1770
463
1637
534
39
718
1361
(d) Mean Performance Efficiency
Fig. 6: Comparison of platforms in Tables I and III, running the tasks in Table IV.
5. CONCLUSION
In this brief note, I have described a method for using data centre FPGAs. I have illustrated
this method by applying it to a case study from computational finance, and evaluating it
upon four data centre FPGA platforms.
As a result of this work, I conclude the following: Firstly, the potential for significantly
improved power efficiency demonstrated by the data centre FPGAs in practice motivates
for their existence. However, platforms should be comprised of several large FPGAs, as the
host system power needs to be amortised across the computational power of the FPGAs.
Secondly, orthogonal FPGA optimisations to that what is inherent in the standard ap-
pear to yield the greatest improvements over a baseline implementation. For example, if a
standard is inherently task parallel, such as OpenCL, then pipeline parallelism maximising
optimisations should be employed. While counter intuitive, I believe that is due to HLS
tools aggressively optimising the paradigm of the standard supported, hence requiring the
programmer to explicitly flag other potential areas of optimisation.
Future Work
An obvious direction future work would be to consider other workloads, such as image
processing or web scale workloads, applied to this methodology.
More ambitious future research would consider the future automation and abstraction of
this method, making data centre FPGAs available to a wider audience.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
Algorithmic Trading: A brief, computational finance case study on data centre FPGAs A:13
ACKNOWLEDGMENTS
Contributions from Dr David B. Thomas, Mr Shane Fleming, Prof George Constantinides and Prof Wayne
Luk have been instrumental in this work. Future publications based upon this note will list them as co-
authors.
Funding support has been generously provided by the Oppenheimer Memorial Trust and the South
African National Research Foundation. I would also like to thank Maxeler, Nallatech, Altera and Xilinx
University Programs for supporting this work in the form of equipment and software donations.
REFERENCES
2013. OpenSPL: Revealing the Power of Spatial Computing. Technical Report. The OpenSPL Consortium.
http://www.openspl.org/wp-content/uploads/OpenSPL-WP1.pdf
Krste Asanovic, Bryan Christopher Catanzaro, David A Patterson, and Katherine A Yelick. 2006. The
Landscape of Parallel Computing Research : A View from Berkeley. EECS Department University of
California Berkeley Tech Rep UCBEECS2006183 18, UCB/EECS-2006-183 (2006), 19.
Tracy Braun, Howard Siegel, and Anthony Maciejewski. 2001. Heterogeneous Computing: Goals, Methods,
and Open Problems. In High Performance Computing, Burkhard Monien, Viktor Prasanna, and Sriram
Vajapeyam (Eds.). Lecture Notes in Computer Science, Vol. 2228. Springer Berlin/Heidelberg, 307–318.
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin
Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Sympo-
sium on Workload Characterisation. IEEE, 44–54. DOI:http://dx.doi.org/10.1109/IISWC.2009.5306797
C. de Schryver, I. Shcherbakov, F. Kienle, N. Wehn, H. Marxen, A. Kostiuk, and R. Korn. 2011. An
Energy Efficient FPGA Accelerator for Monte Carlo Option Pricing with the Heston Model. In
2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig). 468 –474.
DOI:http://dx.doi.org/10.1109/ReConFig.2011.11
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In
Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation -
Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, USA, 10–10.
John C. Hull. 2011. Options, Futures and Other Derivatives (8th edition ed.). Pearson.
Gordon Inggs, Shane Fleming, David B. Thomas, and Wayne Luk. 2014. Is high level synthesis ready for
business? A computational finance case study. In International Conference on Field-Programmable
Technology (FPT 2014). 12–19. DOI:http://dx.doi.org/10.1109/FPT.2014.7082747
Gordon Inggs, David B. Thomas, George Constantinides, and Wayne Luk. 2015. Seeing Shapes in Clouds: On
the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service. In International Work-
shop on FPGAs for Software Programmers (FSP 2015).
Jonathan G. Koomey, Stephen Berard, Marla Sanchez, and Henry Wong. 2011. Implications of historical
trends in the electrical efficiency of computing. IEEE Annals of the History of Computing 33, 3 (Mar
2011), 46–54. DOI:http://dx.doi.org/10.1109/MAHC.2010.28
J.E. Stone, D. Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Het-
erogeneous Computing Systems. Computing in Science Engineering 12, 3 (May-June 2010), 66 –73.
DOI:http://dx.doi.org/10.1109/MCSE.2010.69
Anson H.T. Tse, David B. Thomas, K. H. Tsoi, and Wayne Luk. 2011. Efficient reconfigurable
design for pricing asian options. SIGARCH Comput. Archit. News 38, 4 (Jan. 2011), 14–20.
DOI:http://dx.doi.org/10.1145/1926367.1926371
Zheng Wang, Dominik Grewe, and Michael F. P. O’Boyle. 2014. Portable mapping of data parallel programs
to OpenCL for heterogeneous systems. ACM Transactions on Architecture and Code Optimization 11,
4 (Dec 2014), 1–10.
ACM Journal Name, Vol. V, No. N, Article A, Publication date: January YYYY.
