Optimising financial computation for reconfigurable hardware by Jin, Qiwei
Imperial College London
Department of Computing
Optimising Financial Computation for
Reconﬁgurable Hardware
Qiwei Jin
Submitted in part fulﬁlment of the requirements for the degree of
Doctor of Philosophy in Computing of the Imperial College
December 2013

Declaration
This thesis is a presentation of my original research work. The contributions of others are involved,
every effort is made to indicate this clearly in the references to the literature and acknowledgment of
collaborative research.
Signature: ...........................................
Date: ...................................................
i
ii
Copyright Declaration
The copyright of this thesis rests with the author and is made available under a Creative Commons
Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or trans-
mit the thesis on the condition that they attribute it, that they do not use it for commercial purposes
and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must
make clear to others the licence terms of this work.
iii
iv
Abstract
This thesis proposes novel methodologies for design, optimisation and generalisation of reconﬁg-
urable hardware based ﬁnance computation. The applications of the proposed methodologies to nu-
merical methods which are commonly used in the ﬁnance industry, such as Monte Carlo and Finite
difference are studied in detail. These studies show reconﬁgurable hardware can effectively improve
performance and energy efﬁciency in ﬁnance computation. There are three contributions. First, an
application independent Monte Carlo framework for interest rate derivatives payoff evaluations based
on the HeathJarrowMorton (HJM) mathematical Framework. By identifying three levels of functional
specialisations in the model, the framework is able to retain good performance while supporting mul-
tiple applications. In addition, a process is proposed for the Monte Carlo framework to identify the
optimal reduced precision data representation, in order to utilise hardware resource better and retain
output numerical accuracy. The automatically generated Field-Programmable Gate Array (FPGA)
implementations show signiﬁcant speedup and energy saving over comparable Central Processing
Unit (CPU) and Graphical Processing Unit (GPU). Second, a novel framework for accelerating op-
tion payoff evaluation based on ﬁnite difference method. The parallelism of the proposed architectures
is exploited based on two levels of computational granularities. The implementations are generated
based on a high level description. Signiﬁcant speedup and energy savings are archived comparing our
FPGA designs over both CPU and GPU designs. Third, a novel performance optimisation process
based on dynamic reconﬁguration for stencil computation. By optimally adjusting the underlying
numerical procedure and making use of carefully chosen coefﬁcients for constant multipliers, both
the hardware resource consumption per kernel and the amount of computation needed per problem
are reduced, and the numerical accuracy requirements are also met. Signiﬁcant speedup is shown by
comparing the optimised dynamic design with the unoptimised dynamic design and the original static
design.
v
vi
Acknowledgements
I would like to express my greatest gratitude to my supervisors Professor Wayne Luk and Dr. David
Thomas. It is because of their help, support and patience that my doctoral thesis can be ﬁnished.
Their wise guidance and advice are what is directing me forward in my research towards this thesis.
I sincerely thank Dr. Abib Bocresion, Dr. Tony Chau and Professor Wayne Luk for generously spon-
soring my PhD. In addition Dr. Hans-Peter Bermin and Dr. Stephen Weston devoted their valuable
time to provide me with invaluable insights and guidance.
I would like to tribute special thanks Dr. Tobias Becker for providing ideas and guidance for Chapter 5.
He helped deﬁne the original formulations in Section 5.3. In addition I would like to thank Mr. Gary
Chow Chun Tak for the inspiring discussions we had during our PhDs, and Miss Diwei Dong for
tutorials over mathematical models and numerical methods.
I also express my gratitude to fellow colleagues in Custom Computing Group of Imperial College
London: Dr. Kuen Hung Tsoi, Dr. Chi Wai Yu, Dr. Adrien Le Masle, Dr. Brahim Benkaoui, Dr. Van
Fu, Prof. Qiang Liu, Dr. Timothy Todman, Mr. Philip Potter and Mr. Xinyu Niu for their time of
discussions and experience sharing.
Special thanks to Mr. Nick Ng for proofreading this thesis.
The support of UK EPSRC, FP7 EPiCS and REFLECT projects, the HiPEAC NoE, MAXELER
Technologies, Celoxica and Xilinx are gratefully acknowledged.
vii
viii
Dedication
To my parents: –
who bring me to this world.
To my grandparents: –
who take care of me in my childhood.
To my wife: –
who accompanies me when I am alone.
To my friends: –
who grow up with me and take care of my family when I am away.
ix
x
Terminologies
ASIC - Application Speciﬁc Integrated Circuit
CPU - Central Processing Unit
Customisation - That format, locality and execution of data can be customised to an
application in reconﬁgurable hardware
DSP - Digital Signal Processor
EFD - Explicit Finite Difference
EFD grid setting - Such setting speciﬁes the numbers of price steps and time steps for an
EFD grid
Financial computing - The analytics and computing applications applied in the ﬁnance industry
FPGA - Field Programmable Gate Array
Generalisation - That designs in reconﬁgurable hardware must be general enough to
support a set of domain speciﬁc problems, and that the designs must be
ﬂexible enough to cope with possible requirements change in the future
in a timely manner
GPU - Graphics Processing Unit
HDL - Hardware Description Language
I/O - Input / Output
LUT - Look Up Table
MC - Monte Carlo
“Nice” constants - The constants that yield small ﬁxed point constant multipliers, measured
in number of consumed LUTs, compared to the largest constant
multiplier in the same number format
Number of
computational steps
- The number of data elements to be processed in a ﬁnancial computation
problem
Number of price steps - The number of discretised points in the asset price space
Number of time steps - The number of discretised points in the time space
PDE - Partial Differential Equation
RAM - Random Access Memory
VHDL - Very High Speed Integrated Circuit Hardware Description Language
xi
xii
Contents
Declaration i
Copyright Declaration iii
Abstract v
Acknowledgements vii
Terminologies xi
Contents xiii
List of Tables xix
List of Figures xxi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Demand from Finance Industry . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Evolution in Computational Technologies . . . . . . . . . . . . . . . . . . . 3
xiii
xiv CONTENTS
1.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Objectives and Related Chapters . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Research Approach and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Thesis Organisation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Background 13
2.1 Payoff Evaluation of Financial Options . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Asset Price Dynamics and Payoff Evaluation . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Black Scholes PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Exotic Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Interest Rate Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 HJM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Interest Rate derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
CONTENTS xv
2.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Computational Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.3 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Customisation and Generalisation in Reconﬁgurable Hardware . . . . . . . . . . . . 33
2.7 Optimisation Techniques for Reconﬁgurable Hardware . . . . . . . . . . . . . . . . 34
2.7.1 Bitwidth Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7.2 Dynamic Reconﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 Hardware Description Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Multi-level Customisation 42
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 HJM Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Multi-level Customisation Framework . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Application Specialisation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xvi CONTENTS
4 Explicit Finite Difference Method 64
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Option Payoff Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Categorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.3 The Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Parallel Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Categorisation of EFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Theoretical workﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.3 Hardware Independent Architecture . . . . . . . . . . . . . . . . . . . . . . 79
4.3.4 Design Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Implementation of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.1 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.2 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.1 Computational Performance Analysis . . . . . . . . . . . . . . . . . . . . . 96
4.5.2 Energy Consumption Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.3 Parallel Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5.4 Performance Trend Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
CONTENTS xvii
5 Dynamic Constant Optimisation 109
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Explicit Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3 Dynamic Constant Specialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.1 Specialisation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4 Optimising Dynamic Constant Specialisation . . . . . . . . . . . . . . . . . . . . . 121
5.4.1 General Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4.2 Optimisation approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.1 Dynamic Constant Specialisation . . . . . . . . . . . . . . . . . . . . . . . 132
5.5.2 Optimising Dynamic Constant Specialisation . . . . . . . . . . . . . . . . . 134
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6 Conclusion and Future Work 140
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2 Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.1 Fulﬁll Computational capacity and Energy Efﬁciency Demand from the Fi-
nance Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2.2 Novel Perspectives in Optimising Acceleration Techniques Based on Recon-
ﬁgurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2.3 Generality and Performance for Finance Applications . . . . . . . . . . . . . 144
xviii CONTENTS
6.2.4 Further Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.1 Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3.2 Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3.3 Financial Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography 151
List of Tables
2.1 General comparison between different computational devices. . . . . . . . . . . . . 28
3.1 Parameters in our framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Explanation of input and output ﬁles in our framework as illustrated in Figure 3.3. . . 49
3.3 Volatility structures used in the HJM framework . . . . . . . . . . . . . . . . . . . . 49
3.4 Example interest rate products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Floating Point Resource Comparison: wE = 8, wF = 53 (double precision) and
wE = 8, wF = 17 (reduced precision) on a Virtex-6 SX475T FPGA. Note that the
numbers are based on place and route results . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Fixed Point Resource Comparison: wI = 9, wF = 23 (high precision ﬁxed point)
and wI = 9, wF = 19 (reduced precision ﬁxed point) on a Virtex-6 SX475T FPGA,
compared to the corresponding wE = 8, wF = 17 ﬂoating point kernels in Table 3.5.
Note that the numbers are based on place and route results . . . . . . . . . . . . . . 60
3.7 Comparison of MC simulations using double precision GPP(SW), reduced precision
FPGA and single precision GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Various option types and their corresponding payoff evaluation methods . . . . . . . 66
4.2 Existing PDE solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
xix
xx LIST OF TABLES
4.3 Example Option Pricing problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Resource consumption of Solutions with a single Fine Core. . . . . . . . . . . . . . 96
4.5 European, American and Asian option performance comparison between various hard-
ware platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Speed and energy comparison grid for single precision American option implementa-
tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.7 Performance Comparison Between Different Conﬁgurations of up to 32 Fine Cores.
Note that the FPGA performance reported here is based on the place and route result. 108
4.8 Comparing predicted result with real result from Table 4.4 . . . . . . . . . . . . . . 108
4.9 Resource Consumption for Each Floating Point Operators . . . . . . . . . . . . . . . 108
5.1 Cycle time and area of arithmetic operators. Area is measured in LUT/ﬂip-ﬂop pairs. 132
5.2 Cycle time, area and number of processing elements for the static and dynamic core.
Shown are the estimated values from our model (est.) and values for a real design. . . 132
5.3 Estimated speedup of the dynamic design using (8,46) ﬁxed point arithmetic. The
estimates are based on various conﬁguration times tr. . . . . . . . . . . . . . . . . . 133
5.4 Comparison between the two approaches discussed in Section 5.4.2. The target Eu-
ropean option has the following κ: (70, 70, 0.05, 1.0, 0.3). Note that the LUTs result
is based on place and route results . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.1 Summary of the key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
List of Figures
1.1 Thesis organisation structure. Note that the grey boxes and the white boxes indicate
the related objectives and contributions for each chapter respectively. . . . . . . . . . 10
2.1 Payoff of a knock-in barrier put option. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 A rough categorisation of popular numerical methods. . . . . . . . . . . . . . . . . . 23
2.3 A high level view of NVIDIA GPU architecture. . . . . . . . . . . . . . . . . . . . . 31
2.4 A typical CUDA execution strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 A high level architecture of an FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 The evolution of the forward rate curve from t = 0 to t = tmax in one Monte Carlo
path, giving ti = Ti = ti + T ′i−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Proposed framework for the HJM model, dark grey indicates stable part in the kernel,
while white indicates ﬂexible parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Proposed workﬂow for our Monte Carlo framework. Refer to Table 3.2 for details
about the input and output ﬁles (numbered). . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Domain speciﬁc language Example 1. . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Domain speciﬁc language Example 2, extending Example 1. . . . . . . . . . . . . . 53
3.6 p-value in log scale for Bond Option based on ﬂoating point number representation. . 56
xxi
xxii LIST OF FIGURES
3.7 p-value in log scale for Swaption based on ﬂoating point number representation. . . . 57
3.8 p-value in log scale for Bond Option based on ﬁxed point number representation. . . 58
3.9 p-value in log scale for Swaption based on ﬁxed point number representation. . . . . 58
3.10 Mean and standard deviation of p-value of 10 randomly generated bond options based
on the experiment illustrated in Figure 3.6. Note that the left y axis is for the mean
of p-value and it is in log scale, and the right y axis is for the standard deviation of
p-value and it is in normal scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 A 5 × 6 ﬁnite difference grid, the grey elements show an EFD stencil consists of vi,j
and the three values it depends on at time step i and asset price step j. Note that
(x)+ ≡ max(x, 0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Explicit Finite Difference Grid for Asian Option Pricing. . . . . . . . . . . . . . . . 73
4.3 Explicit Finite Difference Method: Four Categories. . . . . . . . . . . . . . . . . . . 75
4.4 Workload distributions of the four explicit ﬁnite difference categories in a heteroge-
neous system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 One column view of the grid shown in Figure 4.2 at time step i, and the original
column is split into NM equally sized sub-columns each with size Msub = J + 2.
This entire column is updated along the time line, synchronisation occurs between
adjacent sub-columns just before stepping from time i to i− 1. . . . . . . . . . . . . 78
4.6 Generic architecture for computing the ﬁnite difference model. . . . . . . . . . . . . 80
4.7 Proposed realisation of ﬁne-grained parallelism, based on Figure 4.5. . . . . . . . . . 82
4.8 Simple inwards interleaving accessing pattern working with a pipeline. . . . . . . . . 83
4.9 A detailed architecture showing how intra-column data dependency is handled in the
EFD option solver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
LIST OF FIGURES xxiii
4.10 Generating Coarse Cores by customising the number of Fine Cores targeting a given
ﬁnite difference model. DAG stands for Directed Acyclic Graph. . . . . . . . . . . . 85
4.11 Example architectural design for the block Fine Core in Figure 4.9 and Figure 4.12.
The solid black boxes denote inputs and output. . . . . . . . . . . . . . . . . . . . . 86
4.12 Detailed architecture for the connection between Data Module and a Fine Core. The
Fine Core is automatically generated from an update function shown in Table 4.3. . . 87
4.13 The data ﬂow of the hardware implementation on FPGA for evaluating American
option payoffs. Note that for clarity the data module in this ﬁgure is represented in
partial of what is shown in Figure 4.12, although the actual implementation follows
Figure 4.12. The registers in grey hold constant parameters for an option; the bold
squares are ports connecting to peer Fine Cores to swap overlapped elements. . . . . 92
4.14 The register states of Figure 4.13 in an arbitrary node evaluation during an EFD run. 93
4.15 Energy consumption against computational time to compute 6K×30K single preci-
sion American option grids on different devices. . . . . . . . . . . . . . . . . . . . . 99
4.16 Measured execution times of portfolios with up to 64 American Options, solved by
designs under different conﬁgurations (each containing 32 Fine Cores). . . . . . . . 102
4.17 Estimated trend of resource consumption of the single precision European Option
solver implementation in Log scale, based on result in Table 4.4. Note that the dotted
lines are the amount of each resource available on a Virtex-4 VLX160 FPGA. . . . . 104
5.1 A 5 × 6 ﬁnite difference grid, the grey elements illustrate a three-element stencil,
with the value of fi,j depending on three other values in time step i + 1. Note that
(x)+ ≡ max(x, 0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2 Generic architecture for computing the ﬁnite difference model, this Figure also ap-
pears in Figure 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3 Hardware architecture of the block Fine Core in Figure 5.2 for pricing American
Options. The coefﬁcients α, β and γ are constant for a particular option to be priced.
Solid black boxes denote registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Dynamic EFD kernel for American option pricing. The reconﬁgurable area is marked
in grey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 Relationship between the stencil parameters and Δt and ΔZ. Note that user can vary
Δt and ΔZ to obtain different stencil parameters. . . . . . . . . . . . . . . . . . . . 125
5.6 Relative error against Δt and λ, where λ = ΔZ/Δt. Note that dt and Δt are used
interchangeably here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.7 The relationship between values of the coefﬁcients {α, β, γ} and λ (left y-axis),
where λ = ΔZ/Δt; and the corresponding relative error between the EFD result
and the Black Scholes formula result (right y-axis). Note that Relative Error =
|EFDResult−BlackScholesResult|; and in this case the European option has the
following parameters: S = 70, K = 70, t = 3.0, r = 0.05, σ = 0.9,Δt = 0.01. . . . 128
5.8 Estimated and real performance of the static and reconﬁgurable American option pric-
ing design for various numbers of coarse cores. Dynamic1 is based on (8,46) ﬁxed
point numbers; Dynamic2 is based on (7,16) ﬁxed point numbers. Note that the es-
timated performance is calculated by the method proposed in Section 5.3, and real
performance is based on place and route results. . . . . . . . . . . . . . . . . . . . . 135
5.9 Comparison of LUT usage per kernel between the standard and optimised EFD kernel,
based on place and route results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.10 Performance comparison of the optimised and unoptimised American option pricing
design for various numbers of coarse cores. Note that this is based on place and route
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
xxiv
6.1 A splined volatility curve at expiry time t, as part of a volatility surface at a particular
tenor T . Note that the entire volatility cube is a four dimensional hypercube consists
of strike price, tenor, time to maturity and volatility. . . . . . . . . . . . . . . . . . . 150
xxv
xxvi
Chapter 1
Introduction
1.1 Motivation
The ﬁnance industry is a place where cutting edge technologies are being applied in practice. Al-
though in the past two decades the ﬁnance industry has gone from blooming to recessionary, the
constant demand for higher computational capacity has never stopped. This demand has been one of
the major drives for innovations in computational technologies.
This thesis meets the demand in the following way: Firstly, designing reconﬁgurable hardware ar-
chitectures based on multi-level optimisation provides both generality and performance which are
demanded from the ﬁnance industry for adopting reconﬁgurable hardware to its complex infrastruc-
tures. Secondly, the general purpose ﬁnite difference and Monte Carlo architectures described in this
thesis provide viable solutions to fulﬁll the high computational capacity and high energy efﬁciency
demanded by the industry. Lastly, the design optimisation methodologies proposed, including optimi-
sations based on reduced precision data representation and dynamic constant reconﬁguration, provide
novel perspectives in accelerating algorithms using reconﬁgurable hardware.
1
2 Chapter 1. Introduction
1.1.1 Demand from Finance Industry
Before 2008, the ﬁnance industry sought innovation in the complexity of its products. For example,
structured products such as Collateralized Debt Obligations (CDOs) [1] are invented which turns a
portion of high risk credit products into investment-grade securities [2]. Complex products usually
have no closed form solution, therefore computationally intensive numerical methods such as Monte
Carlo methods and ﬁnite difference methods are required to evaluate them.
Ever since the ﬁnancial crisis in 2008, the ﬁnancial companies have been demanding even higher
computational power to understand their risk positions in the market, and to meet regulatory require-
ments. While vanilla products have been more favourable to complex products, legacy portfolios still
exist and the inherent complexity of these portfolios has become a computational burden when their
risk positions are calculated daily or even at real time. On the other hand, vanilla products are traded
with increasing volumes [3]. Extra computational power is required to handle the ever increasing
number of trades efﬁciently.
In the modern ﬁnance industry, the evaluation times for ﬁnancial products are crucial, for both risk
control and market making purposes. As the demand for computation increases [4], large computer
farms are built to cope with computational intensive tasks such as valuation and risk calculation for
portfolios and complex products under different scenarios. It is in the industry’s interest to gain com-
petitive advantage by reducing pricing latency and increasing pricing throughput, in a cost effective
manner.
The sustainability of the ever growing energy consumption to run these farms on a 24/7 basis has
also become a major cost and environmental issue. A study has estimated that 18,118 billion kWh
of electricity was consumed by data centers alone on this planet in year 2010, including cooling
and power distribution losses [5]. With the number of CPU cores doubling everything year in large
investment institutions1, increasing computational capacity by installing more traditional servers is
no longer a sustainable solution.
As a result, there is a demand from the ﬁnance industry for a sustainable computational solution
1Private correspondence with J.P. Morgan.
1.1. Motivation 3
which is both energy efﬁcient and capable of delivering high computational throughput, and the FPGA
technology is a good ﬁt to meet this demand.
1.1.2 Evolution in Computational Technologies
Since 1959, Computational technology had been driven by Moore’s Law [6], which indicates that
the number of transistors on a chip doubles approximately every two years. Before 2005, single core
CPUs beneﬁted fromDennard scaling [7] which allows their performance to grow exponentially every
year. After 2005, Dennard scaling ended [8] though Moore’s Law still stands, CPUs gain performance
by increasing the number of CPU cores instead of the working clock frequency. However, increasing
the core counts on a chip leads to “dark silicon” [9], which is a phenomenon happens when the
power density of a chip is too high to allow all its components to be powered at once. Due to this
limitation, it is predicted that the processors in 2024 will be about 7.9 times faster than the processors
in 2008, which is far away from the 32 times prediction according toMoore’s law. All these challenges
have driven the computational technology to be more efﬁcient and diversiﬁed, with each component
optimised for its task [10].
One possible answer is to use processors or accelerators that are optimised for a particular application.
By specialising processors or accelerators, their execution efﬁciency can be greatly improved com-
pared to general purpose processors based on the Von Neumann architecture [11]. However, there is
usually a trade-off between specialisation and generality, it is up to the system designer to decide the
right combination of the two and based on which to ﬁnd an optimal solution for a particular problem
domain. For problem domains such as ﬁnance computation, there is great research interest to spe-
cialise accelerators for a subset of problems in the problem domain, while allowing enough generality
in the specialised accelerators so that they are also applicable to the variants of these problems.
1.1.3 Summary
From the above discussion it can be seen that the application of reconﬁgurable hardware in ﬁnance
industry is a combinatorial result of the demand of high performance and high efﬁciency computa-
4 Chapter 1. Introduction
tional solution from the industry and the availability of new technologies which diverts us from the
traditional performance gain model relying on Dennard scaling. This thesis is going to propose some
novel methodologies for design, optimisation and generalisation of reconﬁgurable hardware based
ﬁnance computation. In the next section we shall outline the objectives of this thesis.
1.2 Objectives
Financial computing refers to the analytics and computing applications applied in the ﬁnance in-
dustry. Due to the obscure and volatile nature of the market, ﬁnancial computing is usually both
complex and time-critical. Financial computing for reconﬁgurable hardware involves using reconﬁg-
urable hardware as accelerators to improve the performance of ﬁnancial computing, so as to increase
computational speed, reduce power consumption [12], and create new business for the ﬁnance indus-
try [13].
1.2.1 Prelude
Reconﬁgurable hardware such as Field-Programmable Gate Arrays (FPGAs) have applications in
traditional ﬁelds such as ASIC prototyping [14]; as well as new ﬁelds such as network processing
and communication [15], cryptography [16] and bioinformatics [17], which are usually memory and
computationally intensive. FPGA-based systems have demonstrated beneﬁts in processing power and
energy efﬁciency over systems in other areas such as physical simulation [18] and Markov Chain
Monte Carlo based problems [19], amongst which ﬁnance computation is one of the more intensively
studied areas [20] [21], where the industry is heavily involved [12]. For example, FPGA based sys-
tems are adopted in production environment of large investment banks such as J.P. Morgan Chase to
rapidly compute value and risk for credit derivatives portfolios [22].
The main advantage of incorporating an FPGA based system comes from the customisability and
reconﬁgurability in its architecture. Customisability allows the use of tailored data formats for a par-
ticular task, so as to obtain higher performance over standard IEEE data formats such as single and
1.2. Objectives 5
double precision ﬂoating point numbers; reduced precision custom data formats usually allow datap-
aths to run at higher clock frequencies compared to datapaths with standard data formats. Customised
datapaths also consume less hardware resources, allowing higher degree of parallelism given the same
amount of reconﬁgurable hardware resource.
Reconﬁgurability means datapaths can be reconﬁgured to perform different tasks at run time, allowing
less general components to be adopted in the design. Less general components, for example constant
operators such as constant multipliers, usually yield higher clock frequency and lower hardware re-
source consumption compared to general purpose components, as a result, higher level of parallelism
and higher performance can be achieved.
Although customisability and reconﬁgurability can yield higher performance, they are not without
side effects. For example, reduced precision custom data formats affect the numerical accuracy of
the results; and run-time reconﬁgurations have overhead which can offset the performance gain. It is
therefore important to ﬁnd the optimal way to make the best use of customisability and reconﬁgura-
bility for reconﬁgurable hardwares.
Graphical Processing Units (GPUs) are also used extensively for high performance computing. The
main advantage of GPUs is the massive level of parallelism they can provide. Compared to CPUs
which typically have 2 to 16 cores per chip, the latest GPUs have thousands of cores. For instance,
an NVIDIA Kepler GK110 GPU has 2880 IEEE 754 single precision cores or 960 IEEE 754 double
precision cores [23]. Although being less energy efﬁcient compared to reconﬁgurable hardware,
GPUs can provide signiﬁcant speedup for various applications over CPUs, with examples in ﬁnance
computing such as [24].
Numerical methods are used for the payoff evaluation of ﬁnancial derivatives. Popular numerical
methods include tree-based (binomial, trinomial and multinomial trees), ﬁnite difference, numerical
integration, Fast Fourier Transform, and Monte Carlo. Monte Carlo methods differs from the rest in
the way that it starts from the spot price and work forwards in the time line to the derivative expiry
using a large number of randomly generated paths; while the rest works from expiry and propagates
derivative prices backwards in time to the present.
6 Chapter 1. Introduction
1.2.2 Objectives and Related Chapters
The objectives of this thesis are based on the aforementioned circumstances:
1. Customisation in multiple levels: reconﬁgurable architectures must be designed in a way that
the design complexity is hidden by multiple customisation levels, so that complex designs can
be extended easily for new functionality, without signiﬁcant adverse impact on performance.
A domain speciﬁc language must be provided, so that the architecture is usable by users who
are not hardware experts. The result must be quantiﬁed in terms of performance and energy
efﬁciency comparison over different computational platforms.
2. Architectural generalisation for ﬁnancial payoff valuation: the architectures should be ﬂex-
ible enough to support the payoff evaluation of more than one type of derivatives. Numerical
methods such as lattice based methods and Monte Carlo methods must be supported in a way
that as many derivative types as possible are supported by their corresponding architecture.
Optimisation techniques must be applied based on the underlying numerical method and the
accelerator. A detailed comparison in terms of performance and energy efﬁciency must be
carried out between different hardware platforms.
3. Design optimisation based on customisability and reconﬁgurability: the use of custom nu-
merical representation and reconﬁgurable constant operators must be optimised systematically.
A procedure to analyse the error introduced by custom numerical representations so as to search
for an optimal numerical representations must be provided. Similarly, a numerical algorithm
based procedure to automatically generate optimised constant components for run time recon-
ﬁgurable datapaths must be provided. The effectiveness of using custom numerical representa-
tion and run time reconﬁguration must be quantiﬁed in terms of performance gain.
In this thesis, ﬁrstly, a customisable Monte Carlo framework for automated generation of highly
efﬁcient curve based payoff evaluation accelerator, based on the Heath-Jarrow-Morton (HJM) math-
ematical framework is proposed [25] (Chapter 3, fulﬁlls objectives 1 and 2). By identifying three
levels of functional specialisations in the mathematical framework, a manually optimised component,
1.2. Objectives 7
a templated component and a programmable component are adopted in the hardware framework, to
retain good performance while supporting multiple applications. At the same time, a procedure to
search for the optimal custom data format is proposed to improve performance further.
Secondly, a novel reconﬁgurable hardware architecture to accelerate explicit ﬁnite difference method
for ﬁnance derivatives payoff evaluations is presented [26, 27] (Chapter 4, fulﬁlls objective 2). Ex-
plicit ﬁnite difference method is categorised in terms of the change of stencil coefﬁcients over space
and time; and each category is discussed in terms of workload distribution between the host and the
accelerator. Parallel scalability is also discussed in depth to make use of ﬁne grained and coarse
grained parallelism. Furthermore, the FPGA designs are compared to the corresponding GPU and
CPU designs: the performance and energy efﬁciency of each design are studied and compared in
depth.
Finally, a novel methodology to optimise the upper bound of reconﬁgurable area as well as reducing
the amount of computation per problem [28, 29] (Chapter 5, fulﬁlls objective 3). The proposed opti-
misation methodology adjusts stencil based numerical procedures from algorithm level to make use
of hardware resource friendly constants which yields constant multipliers consuming less hardware
resources, so as to minimise the amount of hardware resource consumption per kernel and the amount
of computations per problem. The performance gain is studied by comparing optimised dynamic
designs to unoptimised dynamic designs.
1.2.3 Summary
This section illustrates the objectives of this thesis and the chapters which fulﬁll those objectives. In
particular, Chapter 3 covers customisation in multiple levels and architectural generalisation; Chap-
ter 4 discusses architectural generalisation in detail; and Chapter 5 addresses design optimisation
based on customisability and reconﬁgurability. In the next section we discuss the research approach
and contributions.
8 Chapter 1. Introduction
1.3 Research Approach and Contributions
This research focuses on a generalisation approach for ﬁnancial computing on reconﬁgurable hard-
ware. As a result the architectures in Chapter 3 and Chapter 4 can be used to support a wider range
of ﬁnancial derivatives. The concept of multi-level optimisation and the optimisation technique in
Chapter 3, and the methodology to allow parallel scaling based on ﬁne and coarse grained parallelism
in Chapter 4, are general enough to be applied to problems beyond ﬁnance computation. In addi-
tion, the concept of multi-level optimisation can be applied to other acceleration technologies such as
GPUs. In Chapter 5 the reconﬁguration based optimisation technique can also be extended to stencil
computations outside the ﬁnance domain.
In each chapter, the proposed methodologies are discussed in depth and their effectiveness are sup-
ported by detailed experimental results. Different computational devices such as CPUs and GPUs are
used to contrast the FPGA results based on measured computing performances and energy consump-
tions where possible. These comparative results are of high research interest as they indicate strength
and weaknesses of each computational platform, which serve an important part in the decision making
process and the analytics infrastructure designing process for ﬁnancial institutions.
Two most widely used numerical procedures for derivatives payoff evaluations are addressed in this
thesis: Explicit Finite Difference method and Monte Carlo method. Explicit Finite Difference method
is general enough to be applied to any PDE based derivatives payoff evaluation problem, and Monte
Carlo method is very useful when a closed form solution does not exist and when the derivative is
based on multiple underlyings hence not computationally feasible to be calculated by other numerical
methods.
The main contributions to the aforementioned objectives are:
• Multi-level customisation and optimisation: An application independent Monte Carlo frame-
work for interest rate derivatives payoff evaluations based on the HJM model. By identifying
three levels of functional specialisations in the model, we allow a manually optimised compo-
nent, a templated component and a user programmable component in the framework, so as to
1.4. Thesis Organisation Structure 9
retain good performance and to support multiple applications. In addition, a process is pro-
posed to identify the optimal reduced precision data representation to ensure better utilisation
of hardware resources without adversely affecting result accuracy. The automatically generated
FPGA implementations are over 10 times faster and over 20 times more energy efﬁcient than a
four core CPU. (Chapter 3)
• Architectural generalisation for ﬁnancial payoff valuation: A new methodology for accel-
erating option payoff evaluation for ﬁnite difference methods. The parallelism of the proposed
architectures is customised and generated automatically from a high level description of the
ﬁnite difference models, and the architectures are highly pipelined and capable of evaluating
the payoff of multiple options concurrently. The approach is illustrated by showing three case
studies involving European options, American options and Asian options. Other option types
such as Barrier and Bermudan options are also supported as they are special cases of the afore-
mentioned three option types. The performance result show the FPGA is over 22 times faster
and 25 times more energy efﬁcient compared to a four core CPU. (Chapter 4)
• Design optimisation based on customisability and reconﬁgurability: An innovative dy-
namic constant optimisation procedure which reduces the upper bound of the hardware resource
consumption per kernel, and the amount of computation per problem. The procedure adjusts
stencil based numerical procedures optimally from the algorithm level to ﬁnd constants coef-
ﬁcients which consume less hardware resources. Such procedure also reduces the amount of
computation needed per problem with given accuracy requirement. The result shows that, an
unoptimised dynamic design with partial reconﬁguration is 5.6 times faster over a static design
on FPGA; and an optimised dynamic design is 50% faster than the corresponding unoptimised
dynamic design in terms of execution time. (Chapter 5)
1.4 Thesis Organisation Structure
This thesis is structured as shown in Figure 1.1.
10 Chapter 1. Introduction
???????????????????????
?????????????????????
???????????
?????????????????????????????????
???????????
????????????????????? ???????
??????????????????????????????????????
???????????????????????????????????????????????????????????
??????????? ???????
???????????
?????????????????????????
?????????????????????????????????? ???????
Figure 1.1: Thesis organisation structure. Note that the grey boxes and the white boxes indicate the
related objectives and contributions for each chapter respectively.
Chapters 1 and 2 introduces this thesis and provides background and related work in ﬁnance computa-
tion for reconﬁgurable hardware. Chapter 3 proposes a novel multi level customisation methodology
to generate HJM based Monte Carlo solvers automatically for reconﬁgurable hardware, and based
on that a precision optimisation process to ﬁnd optimal reduced precision. Chapter 4 presents a new
reconﬁgurable architecture and design methodology for option payoff evaluation based on Explicit
Finite Difference method. Chapter 5 describes an innovative optimisation process for applications
based on stencil computation by optimally adjusting the underlying numerical procedure and making
use of carefully chosen coefﬁcients for constant multipliers. Finally, Chapter 6 concludes the thesis
and discusses possible future works.
1.5 Publications
Journal Papers
1.5. Publications 11
1. Qiwei Jin, David B. Thomas and Wayne Luk, “Customising Parallel Architectures for Explicit
Finite Difference Option Payoff Evaluation Models”, submitted to IEEE Transactions on Par-
allel and Distributed Systems, 2013.
2. Xinyu Niu, Qiwei Jin and Wayne Luk, “A Self-Aware Tuning and Evaluating Finite-Difference
Method for Reconﬁgurable Systems”, To Appear in ACM Transactions on Reconﬁgurable Tech-
nology and Systems, 2013.
3. Qiwei Jin, David B. Thomas, Wayne Luk and Benjamin Cope, “Exploring Reconﬁgurable Ar-
chitectures for Tree-based Option Pricing Models”, ACM Transactions on Reconﬁgurable Tech-
nology and Systems, vol.2, pp.21:1-21:17, Sept. 2009.
Conference Papers
1. Qiwei Jin, Tobias Becker, David B. Thomas and Wayne Luk, “Optimising explicit ﬁnite differ-
ence option pricing for dynamic constant reconﬁguration”, In Proc. International Conference
on Field Programmable Logic and Applications (FPL), pp.165-172, 2012.
2. Xinyu Niu, Qiwei Jin, Wayne Luk, Qiang Liu and Oliver Pell, “Exploiting run-time recon-
ﬁguration in stencil computation”, In Proc. International Conference on Field Programmable
Logic and Applications (FPL), pp.173-180, 2012.
3. Qiwei Jin, Diwei Dong, Anson H.T. Tse, Gary C.T. Chow, David B. Thomas Wayne Luk and
Stephen Weston, “”Multi-level Customisation Framework for Curve Based Monte Carlo Fi-
nancial Simulations”, International Symposium on Applied Reconﬁgurable Computing (ARC),
pp.187-201, 2012 (Best Paper Award).
4. Anson H.T. Tse, Gary C.T. Chow, Qiwei Jin, David B. Thomas and Wayne Luk, “Optimising
Performance of Quadrature Methods with Reduced Precision”, International Symposium on
Applied Reconﬁgurable Computing (ARC), pp.251-263, 2012.
5. Gary C.T. Chow, Anson H.T. Tse, Qiwei Jin, David B. Thomas, Philip Leong, and Wayne Luk,
“A mixed precision Monte Carlo methodology for reconﬁgurable accelerator systems”, In Proc.
12 Chapter 1. Introduction
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pp.57-
66, 2012.
6. Tobias Becker, Qiwei Jin, Wayne Luk and Stephen Weston, “Dynamic Constant Reconﬁgu-
ration for Explicit Finite Difference Option Pricing”, In Proc. International Conference on
Reconﬁgurable Computing and FPGAs (ReConFig), pp.176-181, 2011.
7. Qiwei Jin, David B. Thomas and Wayne Luk, “Exploring reconﬁgurable architectures for ex-
plicit ﬁnite difference option pricing models”, In Proc. International Conference on Field Pro-
grammable Logic and Applications (FPL), pp.73-78, 2009.
Short Papers
1. Xinyu Niu, Thomas C.P. Chau, Qiwei Jin, Wayne Luk and Qiang Liu, “Automating Resource
Optimisation in Reconﬁgurable Design”, In Proc. ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA), (to appear), 2013.
2. Qiwei Jin, David B. Thomas and Wayne Luk, “On Comparing Financial Option Price Solvers
on FPGA”, In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM), pp.89-92, 2011.
3. Qiwei Jin, David B. Thomas and Wayne Luk, “Unifying Finite Difference Option-Pricing for
Hardware Acceleration”, In Proc. International Conference on Field Programmable Logic and
Applications (FPL), pp.6-9, 2011.
4. Qiwei Jin, David B. Thomas and Wayne Luk, “Automated Application Acceleration using Soft-
ware to Hardware Transformation”, In Proc. International Conference on Field-Programmable
Technology (FPT), pp.411-414, 2009.
Chapter 2
Background
This chapter discusses and summarises the background knowledge and related works in reconﬁg-
urable ﬁnancial computing. Section 2.1 gives an overview of basic payoff evaluations of ﬁnancial
options; Section 2.2 discusses the modelling of stock-based underlyings and the payoff functions of
exotic stock options; Section 2.3 introduces the modelling of interest rates and the payoff evalua-
tions of interest rate derivatives; Section 2.4 discusses popular numerical methods used for option
payoff evaluation and related research in the reconﬁgurable hardware domain; Section 2.5 compares
different computational devices applicable to ﬁnance applications and discusses their strengths and
weaknesses; Section 2.6 discusses the trade-off between customisation and generalisation in recon-
ﬁgurable hardware; Section 2.7 presents bitwidth optimisation and dynamic conﬁguration as optimi-
sation techniques for reconﬁgurable hardware and previous work on these topics; and Section 2.8
provides background knowledge of the hardware description languages used in this thesis.
2.1 Payoff Evaluation of Financial Options
An option is a contract that gives the holder (party A) the right to buy or sell some asset (the underlying
asset) to the writer (party B) at a ﬁxed priceK (the strike price). There are two basic types of options:
a call option allows party A to buy the asset from party B, while a put option allows party A to sell
the asset to party B.
13
14 Chapter 2. Background
The important factor is that the option provides a right, not an obligation: party A can decide whether
or not to exercise the option (i.e. to buy or sell the asset at price K). For example, a European put
option will only be exercised at the pre-deﬁned option expiry time t, if K is greater than St, where St
is the market price of the asset at t. When the strike price K is greater than St, party A can buy the
asset from the market at a lower price and immediately sell the asset to realise a proﬁt of K − St. If
K < St then party A will choose to leave the option to expire and will neither gain nor lose money.
In contrast, party B has no control over the option, so in the ﬁrst case B will lose K − St, and in the
second case B will neither gain nor lose. Because party A only stands to gain, and B only stands to
lose, B must be offered some compensation. The point of an option pricing model is to determine
how much A should pay B in order to create the option contract, or equivalently how much A can
charge a third party for the option at a later date.
Formally, the payoff of a European put option at expiry is
vput = max(K − St, 0) (2.1)
and the payoff of a European call option at expiry is
vcall = max(St −K, 0) (2.2)
2.2 Asset Price Dynamics and Payoff Evaluation
It is important to model the movement of the underlying asset price (usually stock price) S, in order
to determine the payoff of an option.
2.2.1 Black Scholes PDE
A common assumption is that the underlying asset price follows a geometric Brownian motion [30]:
dS
S
= μdt+ σdW (2.3)
2.2. Asset Price Dynamics and Payoff Evaluation 15
where W follows a Wiener process, σ is the volatility of the underlying asset, and μ is its expected
rate of return. In a discretised format, the rate of change in underlying asset price S follows a normal
distribution:
ΔS
S
∼ φ(μΔt, σ
√
Δt) (2.4)
where ΔS is the change in the asset price S during time Δt and φ(m, s) is a normal distribution with
mean m and standard deviation s. A common practice is to use risk neutral probability measure,
which assumes that the market price of risk is zero, and as a result, the expected rate of return μ of
the asset becomes risk free interest rate r [30]:
dS
S
= rdt+ σdW. (2.5)
Applying Ito’s lemma [30] to Equation 2.5 we obtain
d ln(S) = (r − 1
2
σ2)dt+ σdW. (2.6)
It follows that
St+Δt = Ste
((r− 1
2
σ2)Δt+σ
√
ΔtN) (2.7)
Where St is the underlying asset price at time t and St+Δt is the underlying asset price at time t+Δt.
Equation 2.7 utilises the lognormal property of the stock price dynamics, in which Δt does not have
to be a inﬁnitesimally small number to ensure the accuracy of the result; it is therefore commonly
used in Monte Carlo methods to reduce the amount of calculation without affecting result accuracy.
It can be seen from Equation 2.1 and Equation 2.2 that the payoff v of an option is a function on S
and t (hence v(S, t)). Applying Ito’s lemma to Equation 2.5 we get:
dv = (rS
∂v
∂S
+
∂v
∂t
+
1
2
σ2S2
∂2v
∂S2
)dt+ σS
∂v
∂S
dW. (2.8)
16 Chapter 2. Background
Eliminating the stochastic term containing dW in Equation 2.8, by creating a delta hedge portfolio,
which longs ∂v/∂S unit of underlying asset and shorts one option, we get the Black Scholes partial
differential equation [31]:
∂v
∂t
+ rS
∂v
∂S
+
1
2
σ2S2
∂2v
∂S2
= rv (2.9)
The Black Scholes PDE is useful for calculating payoffs for stock based options. In addition, the
famous Black Scholes formula, which is a closed form solution to calculate the expected payoff for
European options, is derived by solving equation 2.9.
2.2.2 Exotic Options
Exotic options are derivatives which has features making the payoff evaluation more complex than
vanilla options such as European options. Exotic options usually have no closed-form solution. In
many cases exotic options are path dependent, which means that the payoff of the option on expiry
depends on the entire price path of the underlying stock. Examples of exotic stock Options include
American options, Bermudan options, Asian options and Barrier options.
American options and Bermudan options
American options are one of the simplest kind of exotic option. An American option differs from a
European option in the way that it allows its owner to exercise the option at anytime before expiry.
The early exercise feature gives the owner extra ﬂexibility, hence an American option is worth at least
as much as the corresponding European option.
Similar to American options, Bermudan options also have the early exercise feature. Instead of allow-
ing the owner to exercise the option at anytime before expiry, Bermudan options can only be exercised
at a predeﬁned set of discretely spaced times [30].
2.2. Asset Price Dynamics and Payoff Evaluation 17
Arithmetic Asian Options
The payoff of an arithmetic Asian Option is calculated as the arithmetic average of the underlying
asset prices over the life time of the option [32]. Arithmetic Asian Options makes it harder for the
option writer to manipulate the option payoff by intervening the market price of the underlying asset
at option expiry, as the payoff depends on the entire asset price path over time, instead of just the price
at expiry.
The payoff function of an arithmetic Asian put Option is shown in Equation 2.10, where Si is the
price of the underlying asset at the ith time step.
vput = max
(
K − 1
n+ 1
∑n
i=0
Si, 0
)
(2.10)
Barrier Options
Barrier options are a kind of exotic option whose payoff depends on whether the price S of the
underlying asset reaches a predeﬁned barrier level B during a certain period of time. Barrier options
are less expensive than regular options such as European options due to the barrier limitation. There
are two kinds of barrier options: knock-out options and knock-in options. A knock-out option is active
when the option comes into existence and becomes worthless as the underlying asset price moves
across the predeﬁned knock-out barrier. Similarly, a knock-in option is worthless at the beginning
and becomes active as the underlying asset price moves across the knock-in barrier. Figure 2.1 shows
the payoff of a knock-in barrier put option whose payoff was zero at the beginning and becomes
max(0, K − S) as the underlying asset price crosses the barrier B from underneath. The payoff of a
barrier option also depends on the entire path of the underlying asset price over time; the barrier rule
come into effect as soon as the barrier is crossed [30].
Barrier options can be further categorised into four sub-types depending on the initial status of the
option. An example of up-and-in barrier option is shown in Figure 2.1 where the spot price of the
underlying is below the barrier and the option becomes active (knocked-in) once the asset price moves
up across the barrier level. Similarly, the other three types are: up-and-out, down-and-out and down-
18 Chapter 2. Background
??
??
????
???
??
??
???
??
???
??
???????????????????????????
???????
???????
?????????
??????? ?
??????????????????????????
Figure 2.1: Payoff of a knock-in barrier put option.
and-in.
The barrier B can be either continuously monitored or discretely monitored. In practice the knock-in
or knock-out events for a discretely monitored barrier option are checked on a daily basis or monthly
basis, which makes the options less sensitive to the underlying asset price movements.
The payoff of discretely monitored barrier options with a moving barrier are mathematically difﬁcult
to evaluate, where a different barrier Bi is imposed at a particular time ti. No analytical solution has
been found for discrete barrier options with moving barrier.
2.2.3 Summary
So far we have outlined some basic concepts for option payoff evaluation and ﬁnancial mathemat-
ics. In particular, geometric Brownian motion describes mathematically the fundamental assumption
about asset price movements; the Balck Scholes PDE is derived from the Brownian motion descrip-
tion based on Ito’s lemma and delta hedging. The payoff descriptions of exotic options are then
discussed. The next section introduces the HJM model for interest rate derivatives.
2.3. Interest Rate Dynamics 19
2.3 Interest Rate Dynamics
Interest rate modelling is one of the most important ﬁelds in mathematical ﬁnance research to price
ﬁxed income products. In the past two decades this ﬁeld has evolved from modelling a single instanta-
neous interest rate [33] [34] to modelling the dynamics of an entire forward rate curve (also known as
the full term structure of interest rates) [35]. Curve based modelling has the advantage that all longer
rates are ready to use which makes it easy to derive payoff functions for interest rate derivatives based
on zero bond (also known as zero coupon bond) prices.
2.3.1 HJM Framework
The Heath-Jarrow-Morton (HJM) Framework [35] is a general mathematical framework (a class of
models) for modelling instantaneous forward interest rate curve. It differs from short rate models in
the way that it models the full dynamics of the entire forward interest rate curve, as opposed to a single
point on the curve which is the short rate r(t), where r(t) = f(t, t). The HJM framework describes
the no-arbitrage condition that a term structure model must follow, by identifying the relationship:
B(t, T ) = exp(−
∫ T
t
f(t, u)du) (2.11)
and the zero bond process
dB(t, T )
B(t, T )
= r(t)dt+ σ(t, T )dW (t) (2.12)
where B(t, T ) is the price of a zero coupon bond maturing at time T , as seen from time t. The
equation of the HJM framework is derived from Equations 2.11 and 2.12:
df(t, T ) = μ(t, T )dt+ σ(t, T )dW (t) (2.13)
μ(t, T ) = σ(t, T )
∫ T
t
σ(t, u)du (2.14)
20 Chapter 2. Background
where f(t, T ) is the instantaneous forward rate at time T as seen from time t and 0 ≤ t ≤ T ; σ(t, T )
is the forward volatility column vector of size N , whereN is the number of factors in the framework;
W (t) is a random variable under standard normal distribution. For convenience we call t the time
and T ′ the time offset from time t, with T = t + T ′, therefore f(t, T ) ≡ f(t, t + T ′). The number
of random sources in the forward rate, which is usually determined by empirical results, or numerical
analysis of historical data by principle component analysis [36], deﬁnes the number of factors in the
HJM framework. Models developed according to the general HJM framework are non-Markovian,
therefore they cannot be solved by PDE-based numerical methods. However researches have proven
that it is possible to express an HJM model by a ﬁnite state Markovian system [37] [38]
However, modelling each curve has a complexity of O(n2) in the number of time-steps, compared to
the conventional single-point modelling (such as stock option payoff evaluation) which has a com-
plexity of O(n).
2.3.2 Interest Rate derivatives
An interest rate derivative is a derivative where the underlying asset is “the right to pay or receive
a notional amount of money at a given interest rate” [30]. There are many kinds of interest rate
derivatives being traded in the interest rate derivatives market, we discuss those relevant to this thesis.
Bond and Bond Option
A bond is a debt security under which the bond issuer borrows money from the bond holders. The
issuer may have to pay interest to the holders at a series of predeﬁned times, these are called coupons.
At the maturity date of the bond, the principle is payable to the holder. Zero coupon bond (sometimes
called Zero Bond) is a special type of bond used for the payoff evaluation of interest rate derivatives.
Since only the principle is payable at maturity T , the price of the zero coupon bond B(t, T ) at time
t (t ≤ T ) can be used to calculate the time value between t and T , otherwise known as the discount
factor. Equation 2.15 calculates the payoff of a zero coupon bond, where f(t, u) is the instantaneous
continuously compounded rate contracted at time t for riskless borrowing or lending at time u ≥
2.3. Interest Rate Dynamics 21
t [39].
B(t, T ) = exp
(
−
∫ T
t
f(t, u)du
)
(2.15)
A bond option gives the owner the right but not the obligation to buy or sell a bond at a strike price
K at a maturity time T in the future. Note that two maturity times are involved in this situation: the
option matures at t and the bond at T . The payoff function of a bond call option is the following:
vcall = max(B(t, T )−K, 0). (2.16)
Constant Maturity Swap and its Derivatives
A swap is a contract that exchanges interest rate payments on a predetermined notional principal
between two differently indexed legs over a period of time. A constant maturity swap (CMS) differs
from a regular ﬁxed-to-ﬂoat or ﬂoat-to-ﬂoat swap. Instead of resetting the ﬂoating leg periodically to
LIBOR or other short term rate, CMS resets its ﬂoating leg to a swap rate of ﬁxed maturity, such as a
10-year swap rate. Equation 2.17 shows the payoff function of a CMS.
Y (t, T ) =
1−B(t, T )∑T
a=tB(t, a)
(2.17)
A swaption is an option on a swap. For example, an call option with strike K can be created on a
CMS, leading to the following payoff function, where (x)+ ≡ max(x, 0).
(Y (t, T )−K)+
∑T
a=t
B(t, a) (2.18)
Similarly, an option can be created to speculate the spread between two CMSs with different matu-
rity time T1 and T2 for the ﬂoating leg. The payoff function of a CMS spread option is shown in
22 Chapter 2. Background
Equation 2.19
(Y (t, T1)− Y (t, T2)−K)+ (2.19)
2.3.3 Summary
In this section have discussed the basics for exotic options and interest rate derivatives, which are used
in the content chapters. In the next section we shall discuss numerical methods that are commonly
used in ﬁnance computation.
2.4 Numerical Methods
Numerical methods are applied if a closed-form solution does not exist to evaluate the payoff of a
ﬁnancial instrument. Popular numerical methods include tree-based (binomial, trinomial and multi-
nomial trees), ﬁnite difference, numerical integration, Fast Fourier Transform (FFT), and Monte
Carlo (MC). As shown in Figure 2.2, numerical methods can be roughly divided into two categories:
lattice methods, which work backwards from the exercise time in the future to the current time, using
a lattice which describes the movement of the instrument payoff over space (underlying asset prices)
and time; and Monte Carlo methods which work forward from the current time to the exercise time,
using a large number of random paths. In this section we brieﬂy discuss these methods and their
corresponding hardware acceleration research, then we look into the details of ﬁnite difference and
Monte Carlo methods which are relevant to this thesis.
• Tree-Based Methods discretise time and capture the price variation over time of the underly-
ing asset. The tree-based model is introduced by [40], which is a two-state lattice also known
as the binomial tree method. Variations of tree based methods include trinomial tree (three
state lattice) [41] and multinomial tree [42]. Instead of using two states in the lattice, three
or more states are used to capture random movements of the underlying asset price. The re-
search in accelerating tree-based method includes my undergraduate ﬁnal year project [43] to
2.4. Numerical Methods 23
?????????????????????????
?????????????
????????????????????????
??????????? ?
?????????????????
???????????????????????????????
?? ???????????????????? ?
???? ????????????????? ???
??????????
???????????
???????
??
?????
??
Figure 2.2: A rough categorisation of popular numerical methods.
achieve high throughput for binomial and trinomial trees by a fully pipelined implementation
on reconﬁgurable hardware; a GPU based implementation for binomial trees which processes
multiple time steps in parallel by reducing time dependency in the tree [44]; and a binomial tree
implementation on FPGA with four parallel datapaths [45].
• Numerical Integration Method is sometimes called Quadrature method. By utilising numerical
integration to capture the asset price dynamics, the numerical integration method can be seen as
analogous to a multinomial lattice [46]. It has been demonstrated by [46] that the method can
be used for various options based on the Black Scholes PDE, such as discrete barrier, Bermudan
and American options. FPGA based research for the numerical integration method includes an
implementation [47] and an optimisation technique to ﬁnd optimal data representation format
for reconﬁgurable hardware [48].
• Fast Fourier Transform (FFT) Method is an efﬁcient tool to compute weighted sum of vari-
ables [49] in the form of
wk =
N∑
j=1
e−i
2π
N
(j−1)(k−1)xj, for k = 1...N (2.20)
where x is an input vector of size N , and w is the result vector. The FFT algorithm reduces the
number of multiplications involved from N2 to N log2N . With a given characteristic function,
the payoffs of ﬁnancial derivatives can be written in the above form which allows efﬁcient
computation. Although the method is less obvious mathematically, the application of FFT based
24 Chapter 2. Background
derivative payoff evaluation on reconﬁgurable hardware can be achieved by utilising dedicated
FFT kernels such as [50] to achieve high computational performance.
• Finite Difference Method is used to solve partial differential equations (PDEs). The method is
very popular due to: (a) the method has a sound theoretical and mathematical basis; (b) it is
ﬂexible enough to be applied to many problems in many scientiﬁc domains, such as Laplace
Heat Equation in thermodynamics [51], Maxwell’s Wave Equation in electromagnetism [52]
and the Black-Scholes equation in ﬁnance [30]; (c) it has a regular computational pattern and
can be implemented easily in software [53]. There are existing works to study FPGA accel-
eration of ﬁnite difference methods, such as [54] and [55]. However these works are highly
domain speciﬁc and not generic for ﬁnancial option pricing. In this thesis, a generic hardware
architecture of Finite Difference Method for option payoff evaluation is presented in Chapter 4;
and an automated procedure to optimise the architecture for dynamic constant reconﬁguration
is presented in Chapter 5.
• Monte Carlo Method relies on sampling a large number of randomly generated paths to com-
pute a result with desired accuracy. Since each sampling is independent from the others, the
Monte Carlo Method is easily parallelisable and is suitable for hardware accelerators such as
FPGAs and GPUs. Earlier work involving the Monte Carlo Method include: an architecture
with a pipelined datapath and an on-chip instruction processor has been reported for speeding
up the Brace, Gatarek and Musiela (BGM) interest rate model for derivatives evaluation [20];
an FPGA-based stream accelerator with higher performance than GPUs and Cell processors has
been proposed for evaluating European options [56]; a platform independent domain speciﬁc
language has been invented to produce optimised pipelined designs with thread level paral-
lelism for Monte Carlo simulations from a high level abstraction [21]; and a control variate
Monte Carlo design for Asian options is presented [57]. More recent work tends to tackle more
sophisticated Monte Carlo simulation, such as a generic architecture for a particular Markov
Chain Monte Carlo method to draw sample from multimodal distributions [19], and an im-
plementation for American options evaluation using least-squares Monte Carlo method [58].
Optimisation for customised data format has also been studied, for example an adaptive FPGA
architecture for Monte Carlo simulations is reported which conﬁgures the system number repre-
2.4. Numerical Methods 25
sentation at run time based on the Kolmogorov-Smirnov test [59], and mixed precision optimi-
sation methodology for Monte Carlo simulations has been proposed [60]. However, nothing has
been reported with a systematic approach to generate designs based on a high level description
for sophisticated curve based ﬁnancial Monte Carlo simulations. In this thesis, a multi-level
customisation framework is proposed for curve based Monte Carlo ﬁnancial simulations in
Chapter 3: based on a high level domain speciﬁc language, high throughput implementations
are generated automatically, utilising optimised number representation format.
2.4.1 Finite Difference Method
Finite Difference Method approximates derivatives in partial differential equations by linear combi-
nations of function values at the grid points [61]. There are three different types of approximations
for derivatives: forward difference, backward difference and central difference. In this section we
give example to derive these approximations for ﬁrst order derivatives. Higher order derivatives can
be approximated in a similar way.
Assuming the ﬁrst order derivative is ∂v/∂x, where v is the unknown function. To approximate the
derivative, variable x is discretised into N equally sized intervals in range [Xmin, Xmax], where Xmin
andXmax are the minimal value and maximum value of x. The size of intervalΔx = (Xmax −Xmin)/N .
For simplicity we have vi = v(xi) and the nth order derivative at xi to be (∂nv/∂xn)i.
The approximation is based on Taylor Expansion:
v(x) =
∞∑
n=0
(x− xi)n
n!
(
∂nv
∂vn
)
i
(2.21)
Applying Equation 2.21 for xi+1 = xi +Δx and xi−1 = xi −Δx, we get:
vi+1 = v(xi +Δx) = vi +Δx
(
∂v
∂x
)
i
+
(Δx)2
2
(
∂2v
∂x2
)
i
+
(Δx)3
6
(
∂3v
∂x3
)
i
+ . . . (2.22)
vi−1 = v(xi −Δx) = vi −Δx
(
∂v
∂x
)
i
+
(Δx)2
2
(
∂2v
∂x2
)
i
− (Δx)
3
6
(
∂3v
∂x3
)
i
+ . . . (2.23)
26 Chapter 2. Background
Forward difference approximation is calculated based on Equation 2.22:
(
∂v
∂x
)
i
=
vi+1 − vi
Δx
− Δx
2
(
∂2v
∂x2
)
i
− (Δx)
2
6
(
∂3v
∂x3
)
i
− . . . = vi+1 − vi
Δx
+O(Δx) (2.24)
Backward difference approximation is calculated based on Equation 2.23:
(
∂v
∂x
)
i
=
vi − vi−1
Δx
+
Δx
2
(
∂2v
∂x2
)
i
− (Δx)
2
6
(
∂3v
∂x3
)
i
− . . . = vi − vi−1
Δx
+O(Δx) (2.25)
Central difference approximation is calculated as following:
(
∂v
∂x
)
i
=
vi+1 − vi−1
2Δx
− (Δx)
2
6
(
∂3v
∂x3
)
i
+ . . . =
vi+1 − vi−1
2Δx
+O(Δx)2 (2.26)
In this thesis forward difference approximation is used for the time domain and central difference
approximation is used for the asset price domain. A complete list of approximations used in this
thesis is shown as the following:
(
∂v
∂t
)
i
=
vi+1 − vi
Δt(
∂v
∂S
)
i
=
vi+1 − vi−1
2ΔS(
∂2v
∂S2
)
i
=
vi+1 − 2vi + vi−1
(ΔS)2
2.4.2 Monte Carlo Method
Monte Carlo (MC) Method is commonly used to solve scientiﬁc and engineering problems. The
method is based on the central limit theorem and the law of large numbers. The main idea of MC is to
take a large number of random samples of a variable of interest, then take an average of the sampling
results. In a scenario where N mutually independent, identically distributed random numbers Xi are
sampled in a Monte Carlo simulation, where X is the variable of interest, and i = 1, 2, . . . , N ; if the
expected value of X exists and the sample average is X¯ =
∑N
i=1Xi, the weak law of large numbers
2.4. Numerical Methods 27
states that, for ε> 0, the sample average approaches the expected value μ for large N, as shown in
Equation 2.27, where p(x) is the probability of x [62].
lim
N→∞
p
(|X¯ − μ| > ε) = 0 (2.27)
Moreover, if σ is the standard deviation of a normal distribution N(0, σ2), and z is a real number, the
central limit theorem states that the distribution of the standard error is normal when the number of
simulations N approaches inﬁnity, as shown in Equation 2.28 [63].
lim
N→∞
p
(√
N
σ
(X¯ − μ) ≤ z
)
=
1√
2π
∫ z
−∞
e−z
2/2dz (2.28)
This means that the rate of convergence of the MC method isO(√N), and that adding one signiﬁcant
ﬁgure of accuracy requires increasing N by a factor of 100.
2.4.3 Summary
In this section we discussed various numerical methods used in ﬁance computation. In particular
Finite Difference method and Monte Carlo method are illustrated in detail. Both methods are used
commonly in the ﬁnance industry. Generally speaking, Finite Difference method provides good ac-
curacy and convergence rate, however the amount of computation required by the method grows
exponentially with number of variables, therefore it is usually adopted when the number of variables
are small. On the other hand, the Monte Carlo method converges a lot slower compared to Finite
Difference method, but the amount of computation required by the method do not grow as fast with
increasing number of variables. Therefore it is usually used when the number of variables are large.
There are other pros and cons for each method such as the time direction of computation. Finite Dif-
ference method calculates backwards in time, therefore it can handle path dependent payoff functions
easily. Monte Carlo method calculates forward in time, therefore it requires approximation methods
such as least squire to handle path dependency. In practice one need to take into account numerical ac-
curacy, computational complexity, as well as the underlying computational platform to decide which
numerical method is most suitable for the calculation. In the next section we describe and compare
28 Chapter 2. Background
the computational platforms used in this thesis.
2.5 Computational Platforms
This section introduces computational devices commonly used in both research and industry; these
includes CPU, GPU and FPGA. Table 2.1 is a general comparison of these devices in terms of com-
monly used metrics. Amongst these metrics, throughput and energy consumption are two of the more
important ones in inter-device comparisons; they are also application dependent. In this thesis every
effort had been made to obtain a fair comparison between these devices in different case studies of
ﬁnancial computing. Individual performance and energy consumption comparisons are reported in
each chapter, based on these case studies and experimental results.
CPU GPU FPGA
Clock freq. (MHz) high (2000-3000) high (1000-2000) low (100-200)
Power consumption (Watt) high (100-150) high (> 200) low (< 20)
Parallelism (Num. Cores) low (4-8) high (100s) High (depends on the size of FPGA)
Pipelining (Num. clk. cycles) low (< 10) medium (> 10) high (100s)
Reconﬁgurability low low high
Instruction set ﬁxed ﬁxed ﬂexible
Floating-point num. format double / single double / single customisable
Table 2.1: General comparison between different computational devices.
2.5.1 CPUs
Central Processing Units (CPUs) are the most pervasively used computational device. CPUs are de-
signed for serial processing with limited amount of vector processing capacity. The computational
algorithms are coded into programmes which are compiled by compilers into CPU instructions. Mod-
ern CPUs usually have four main steps in an instruction cycle to execute the instructions:
• fetch: the next instruction is fetched from the program memory;
• decode: the decoder interprets the instruction into opcode (operation type) and operands (mem-
ory location, value or other additional information);
2.5. Computational Platforms 29
• execute: the instruction is executed by the relevant function units of the CPU to perform the
actions required by the instruction
• writeback: the result is stored in registers or memory
A management and scheduling unit in CPU is used for branch prediction, instruction ordering and
execution. Although the clock frequency of CPU is high, memory access and the execution cycles are
often the performance bottlenecks. The power consumption of CPUs are also high due to high clock
frequency.
Earlier generations of CPUs usually have only one processing core, and the performance was im-
proved by: (a) increase the running clock frequency (each CPU operation takes less time to complete),
(b) execution optimisation (do more work per clock cycle) and (c) cache improvement (increase the
size of cache and introduce multiple cache levels to fetch data faster). Amongst which (a) is usually
used as the CPU’s performance indicator. Computer programmes can automatically beneﬁt from this
type of performance improvement by just adopting a new CPU platform, with minimal changes to the
software implementation.
This was the case until the power dissipation in the CPU becomes signiﬁcant and the thermal power
becomes unmanageable. Thermal power can be calculated by Equation 2.29, where C is the capaci-
tance of transistors, f is frequency and V is voltage.
Pthermal = CV
2f (2.29)
Although techniques such as clock gating, smaller transistors and lower-voltage structures are used
to save energy, it is hard to improve the performance of a CPU core further by increasing the clock
frequency while keeping its thermal density in a manageable level.
According to Moore’s Law which states that the number of transistors which can be manufactured
on a single die will double every 18 months, an alternative way to improve the CPU performance is
to allow multiple CPU cores with lower clock frequency on the same chip. This however requires
changing in software implementation to make use of the multi-core architecture.
30 Chapter 2. Background
2.5.2 GPUs
Graphics Processing Units (GPUs), over the past few years, have evolved from a ﬁxed-function
special-purpose processor to a full-ﬂedged parallel programmable processor with additional ﬁxed-
function special-purpose functionality [65]. Most silicon area on the GPU chip is dedicated to ﬂoating
point units which include texture, scalar and vector processors for graphics computations. Massive
instruction level parallelism can be achieved thanks to the many-core GPU architecture. Latency of
the operations in GPUs are hidden by thread-level parallelism. For example, threads are grouped into
warps and are executed in batches, since the number of threads are usually much greater than them
number of processing cores, the cores in the GPU are utilised most of the time. The clock frequency
and power consumption of GPUs are usually high.
Recently general-purpose computing on graphics processing units (GPGPU) became popular, as the
application programming interfaces (APIs) and the corresponding programming languages for de-
veloping GPGPU applications are more sophisticated and are easier to use. There are two popular
GPGPU programming ecosystems: CUDA and OpenCL.
Compute Uniﬁed Device Architecture (CUDA) from NVIDIA uses a “C” like programming lan-
guage [66]. The developer programme a special “kernel” function which is executed on each process-
ing core in the GPU. The “kernel” function can access memory and computational elements in the
device. A typical GPU-assisted execution ﬂow has four main steps:
• Copy the data to be processed from host memory to GPU memory.
• Execute kernel function using GPU so that the desired computation is carried out
• Wait for all threads to ﬁnish executing
• Copy the result from GPU memory back to host memory.
Figure 2.3 shows a high level view of NVIDIA GPU architecture. The host can execute kernel func-
tions on the GPU’s Streaming Multiprocessors (SMs) and access the GPU’s global memory via a PCI
Express link directly. Each SM has a number of cores which are basic computational units in GPU.
2.5. Computational Platforms 31
???? ????
???? ????
???? ????
???? ????
??? ???
?????????????
??????????
???????????????????
???? ????
???? ????
???? ????
???? ????
??? ???
?????????????
??????????
???????????????????
???
???? ????
???? ????
???? ????
???? ????
??? ???
?????????????
??????????
???????????????????
?????????????
???
????
???????????
????
Figure 2.3: A high level view of NVIDIA GPU architecture.
The cores in the same SM can communicate to each other via shared memory, which has a typical us-
able size of 16-48 kilobytes per SM (the rest reserved for L1 cache) and an access latency comparable
to registers. SMs communicate via the global memory which is larger in size (2-6 gigabytes) but also
has a much higher access latency (hundreds of clock cycles).
A typical execution strategy of CUDA is shown in Figure 2.4. The entire computation is organised as
a user deﬁned computational grid, which comprises a number of thread blocks. Each thread block is
executed in one SM, as a result, the threads in the block can communicate via shared memory. When
executing, the threads in a thread block are organised into warps which are executed in batches; the
threads in the same warp are guaranteed to be executed in parallel [67].
OpenCL (Open Computing Language) is developed by the Khronos Group which was founded in
2000 by companies including ATI, Intel and NVIDIA [68]. Unlike CUDA which is exclusive to
NVIDIA devices, OpenCL is a more general purpose parallel programming language targeting many
CPUs, GPUs and other platforms from different vendors. It provides a common language, program-
ming interfaces, and hardware abstractions which enable developers to accelerate applications with
task-parallel or dataparallel computations in a heterogeneous computing environment consists of host
CPU and any number of attached OpenCL devices [69]. So far OpenCL has been adopted by Intel,
AMD, NVIDIA and ARM.
32 Chapter 2. Background
???????
?????
?????
???????
??????
?????
???????
??????
?????
???????
?????
?????
???????
??????
?????
???????
??????
?????
??????????????????
??????????????????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????
???????
?????????????????
??????????????????
????????
????????????????
???????????????
?????????????
Figure 2.4: A typical CUDA execution strategy.
Research has been done to compare OpenCL to CUDA. For programming languages, it is usually the
case that generality implies performance penalty. Therefore it is not surprising to ﬁnd out that CUDA
is faster than OpenCL in both data transferring and execution when nearly identical kernels are run on
the same GPU [70]. A similar observation has been conﬁrmed by [71], which claimed that for most
applications, CUDA performs at most 30% better than OpenCL. However, [71] has also shown that
with proper manual tuning to the code, OpenCL can achieve similar performance to CUDA.
2.5.3 FPGAs
Field-programmable gate arrays (FPGAs) are integrated circuits designed to be conﬁgured by the
user after manufacturing. The conﬁguration, which is usually speciﬁed by a hardware description
language (HDL), is synthesized, placed and routed by the FPGA vendor tool chain. A bit-stream
ﬁle is generated by the tool chain to conﬁgure the FPGA device in the end. FPGAs can be used
as alternatives to application-speciﬁc integrated circuits (ASICs) to reduce nonrecurring engineering
cost, due to the ability to update its functionality, and to partially reconﬁgure a design.
An FPGA mainly contains the following components: logic blocks which are programmable logic
components; and routing components which interconnect logic blocks. Figure 2.5 shows a high level
architecture of an FPGA.
2.6. Customisation and Generalisation in Reconﬁgurable Hardware 33
????
?????
?????
????? ???????????????
Figure 2.5: A high level architecture of an FPGA.
2.5.4 Summary
In this section provides a high level picture of how CPU, GPU and FPGA works and a general com-
parison in terms of clock frequency, power consumption and parallelism which is summarised in
Table 2.1. Next section discusses customisation and generalisation in reconﬁgurable hardware.
2.6 Customisation and Generalisation in Reconﬁgurable Hard-
ware
The HiPEAC (European Network of Excellence on High Performance and Embedded Architecture
and Compilation) road map [10] indicates that there is a trade-off between generality and efﬁciency
for computational technologies in general. This is usually true as more work needs to be done per
task by the underlying platform in order to support the extra generality.
34 Chapter 2. Background
For reconﬁgurable hardware this trade-off is a unique advantage over general purpose processors.
Unlike general purpose processors, whose datapaths are ﬁxed, reconﬁgurable hardware can have arbi-
trary datapath tailored for an application. Such customisation includes aspects such as format, locality
and execution of data. We now examine these aspects in the context of ﬁnance applications (customi-
sation).
Data format is important for ﬁnance applications as it determines the accuracy of the ﬁnal result.
Since reconﬁgurable hardware usually provide problem dependent solutions, it is possible to optimise
the data format based on the constraint that the result accuracy requirement is met. This is discussed
in more detail in the next section. Data locality is also vital for the performance of reconﬁgurable
hardware. Due to its long pipelines with latencies of tens of clock cycles, any pipeline drain results in
a signiﬁcant performance penalty. For optimal solution data needs to be organised such that the design
is fully pipelined (temporal parallelism). Data execution involves organising computation in a way
that the maximal parallelism is achieved for a particular reconﬁgurable device (spacial parallelism),
so that the available computational resource is used optimally.
There are however demands from the ﬁnance industry that requires the design to be more ﬂexible (gen-
eralisation). For example, a numerical method should support the payoff evaluation of multiple
derivatives, the problem requirements may change over time, and the problem is usually time crit-
ical. Such requirements indicate that designs in reconﬁgurable hardware must be general enough to
support a set of domain speciﬁc problems, and that the designs must be ﬂexible enough to cope with
possible requirements change in the future in a timely manner.
In this thesis, Chapter 4, Chapter 5 and Chapter 3 discusses generalisation and customisation of
reconﬁgurable architectures.
2.7 Optimisation Techniques for Reconﬁgurable Hardware
The reconﬁgurability of FPGAs differentiates them from other types of accelerators such as GPUs, in
a way that the circuit can be ﬁne-tuned and adapted to a particular application given a speciﬁc user
requirement. As a result, a higher proportion of silicon in an FPGA will be doing useful work towards
2.7. Optimisation Techniques for Reconﬁgurable Hardware 35
the user requirement compared to a GPU or a CPU. Two techniques are commonly used to exploit
reconﬁgurability: (a) customisable data operations and (b) dynamic reconﬁguration. We discuss these
two techniques in this section.
2.7.1 Bitwidth Optimisation
For reconﬁgurable hardware, the ability to adopt customisable data operations such as reduced pre-
cision operations can be exploited to achieve higher performance for many applications. In general
reduced precision number operators for FPGAs have higher clock frequencies and consume less hard-
ware resources and energy, compared to standard IEEE single precision and double precision ﬂoating
point operators. As a result, given the same amount of hardware resources, a higher degree of paral-
lelism can be achieved by reduced precision operators.
However, reduced precision data formats comes at a cost of numerical accuracy. In this thesis the
numerical error caused by using number representation formats of lower accuracy is called ﬁnite
precision error, given that a standard precision is present. Finite precision error is usually caused by
rounding intermediate or ﬁnal values in the datapath. As a result, reducing the number of bits in a
data representation usually increases ﬁnite precision error and reduces result accuracy, and vice versa.
Research has proved that using reduced precision data formats can increase both performance and
energy efﬁciency. For example, by using appropriate word-length for internal variables, a range of
adaptive ﬁlters and polynomial evaluation circuits can achieve on average: 66% reduction in area,
87% in power, and 20% in speedup, over common alternative design strategies [72]. Therefore it is
important to research and make use of bitwidth optimisations for different applications.
General Optimisation Approach
It is a common approach to model both the numerical accuracy of the output result and the area-delay
of a datapath, in terms of the precision of the data formats used in the datapath. Optimisation is
done by minimising the area-delay product while meeting the minimal accuracy requirement. Ex-
isting work include simulation [73], interval arithmetic [74], backward propagation analysis [75],
36 Chapter 2. Background
afﬁne arithmetic [76] and polynomial algebra [77]. Recent work utilise mixed precision ﬂoating point
numbers for collision detection algorithm [78].
Monte-Carlo-Speciﬁc Optimisation Approaches
As discussed in Section 2.4.2, Monte Carlo method is based on the summation of results from a large
number of randomly generated paths. This is intrinsically advantageous to reduced precision data
formats as the ﬁnite precision errors from each path of the Monte Carlo simulation may have different
signs and will cancel out each other in the summation process.
There are two types of approaches to manage ﬁnite precision errors in Monte Carlo Methods. The
ﬁrst approach lets the user choose the appropriate data format in the application, therefore the user is
responsible to determine whether the ﬁnite precision error in the output results meet the requirements.
Examples of this approach include: an architecture for radiative heat transfer simulation which can be
targeted to implementations using different ﬂoating point datapaths [79]; and implementations for a
MultiFactor Gaussian Copula CDO pricing algorithm based on IEEE single/double precision ﬂoating
point numbers as well as integer representation [80].
The second approach lets the user specify the error tolerance level for the ﬁnal result and appropriate
number representation format is chosen by an optimisation procedure to make sure the user speci-
ﬁcation is met in the way that the error bound of the result is smaller than the error tolerance, and
the performance is optimised. One possible method is to obtain the error bound by recording the
maximum relative error of the sampling path, such as [81]. The maximum relative error can also be
calculated analytically using interval arithmetic [82] and afﬁne arithmetic [83]. However, this method
usually over-estimates the ﬁnite precision error as it does not consider the ﬁnite precision error for
each individual path may cancel each other out in the summation process.
It is also possible to use empirical methods to obtain the effect of ﬁnite precision error to the ﬁnal
result. In [84], 100K N(0, 1) normally distributed random samples are generated by reduced preci-
sion random number generator and the normality of the random samples is tested by the chi-square
(χ2) test and the Kolmogorov-Smirnov (K-S) test [85]. The random number generator is then con-
2.7. Optimisation Techniques for Reconﬁgurable Hardware 37
nected to an application with reduced precision ﬁxed point datapath, which has been tested for error
bound separately by range analysis. The error requirement of the output result is assumed to be met
automatically. This assumption may not be valid since the components of the application are tested
separately by different methods, there is no guarantee that the error bound of the combined result is
contained under the error tolerance level.
Methods to adjust data format at runtime have been studied as well. For example, a design based on
statistical analysis is proposed to compare three empirical cumulative distribution functions (EDFs)
computed by many reduced precision datapaths and two high precision datapaths. Kolmogorov-
Smirnov test is used to compute two distance scores between the high precision EDFs and the low
precision EDF. The scores are used to control the reduced precision datapath at runtime to ensure to
error tolerance requirement is met [59].
In addition, a mixed precision ﬂoating point framework has been designed to simulate the value of
ﬁnite precision error directly; the value is then used to correct the reduced precision output result, so as
to improve the result accuracy. The method works under the assumption that the ﬁnite precision error
follows a normal distribution in the Monte Carlo simulation, which is showed empirically in [60].
In this thesis two novel optimisation approaches are discussed. In Chapter 5, optimisation is done by
dynamic constant reconﬁguration for Explicit Finite Difference method. An error function is obtained
empirically and used as a directive to execute the algorithm optimally. A Monte Carlo simulation
representative of real world scenario is then used to check the result precision requirement is met.
In addition, Chapter 3 proposes the use of Welch’s t-test to check that the reduced precision result is
statistically indifferentiable from the reference high precision result. This allows the use of reduced
precision data format in complex Monte Carlo simulations, and overcomes the limitations of [84].
2.7.2 Dynamic Reconﬁguration
Dynamic reconﬁguration is also called run-time reconﬁguration. It is another technique exclusive
to reconﬁgurable hardware for performance and productivity improvement. The technique can be
applied to design process as well as application optimisation.
38 Chapter 2. Background
In the design process, dynamic partial reconﬁguration is used to accelerate design validation by
reusing ﬂoorplans and placing frequently changing modules into separate, partially reconﬁgurable
regions during design time [86].
Dynamic reconﬁguration is also used to optimise application area. For example, real-time frame
processing for image streams used in mobile robotics can beneﬁt from dynamic reconﬁguration since
only a small subset of tasks have to be executed concurrently and it is not necessary to implement all
tasks at the same time on the FPGA [87]; similar technique is applicable to sorting architectures [88].
Temporal partitioning is studied to represent tasks using data ﬂow graphs (DFGs), the DFG is then
partitioned under resource constraints [89], so that large designs can ﬁt into a smaller area. In addition,
techniques such as Integer Nonlinear Programming (INLP) are used to minimise communication
between partitioned segments [90].
Designs with slowly varying inputs can also beneﬁt from run-time reconﬁguration. For example,
by applying LUT-level parameterisable reconﬁgurations, 40% reduction in FPGA resources can be
achieved for 32-tab FIR ﬁlters [91]; by utilising constant specialisation, a software deﬁned radio is up
to 49% more energy efﬁcient and up to 87% more area efﬁcient than a non-reconﬁgurable design [92].
Although existing works have shown signiﬁcant beneﬁts of applying dynamic reconﬁguration, the
main emphasis is on saving hardware resource rather than improving performance. In addition, none
of the ideas have been applied to ﬁnance applications.
In Chapter 5 of this thesis, dynamic constant reconﬁguration and an optimisation methodology to
further reduce the design area are discussed for ﬁnance applications using the explicit ﬁnite difference
method, which allows higher parallelism and better performance.
2.7.3 Summary
This section discusses relevant optimisation techniques applicable to reconﬁgurable hardware. In
particular, bitwidth optimisation utilises customised data format to save hardware resource and gain
higher parallelism, at a cost of numerical accuracy. Dynamic reconﬁguration exploits the time domain
to gain further performance; the reconﬁgurable hardware is conﬁgured at run time to accommodate
2.8. Hardware Description Languages 39
computational requests, so that more resource in the hardware is utilised for useful computation. The
next section introduces hardware description languages for reconﬁgurable hardware.
2.8 Hardware Description Languages
Hardware Description Languages (HDL) are essentially programming languages to programme re-
conﬁgurable hardware. The most well-known and widely used HDLs are VHDL and Verilog. Al-
though these are full-featured languages supporting hardware simulations and syntheses, they are
considered to be relatively low level descriptions which builds functions from the register-transfer
level. The application developer needs to divide an algorithm into smaller hardware components, and
explicitly describe the input, output and intermediate state of each component at each clock cycle.
The connections between the components are also described explicitly. As the complexity of algo-
rithms increases, low level descriptive languages makes it hard for human to keep track of the status
of each component in the system, preventing wide adoption of FPGAs in places such as the ﬁnance
industry. In fact traditional HDLs are very different from typical sequential software programming
languages such as “C++” and “Java”, which are widely adopted in existing systems and have a large
user base, makes them hard to be accepted by traditional programmers. To overcome this, researches
have been done to allow high level descriptions (i.e. “C” to gate descriptions) to be used for FPGA
programming. In this section we introduce some popular high level HDLs.
Handel-C is a behavioral language for FPGA design and is based on the ANSI-C programming lan-
guage. Handel-C is a superset of the ANSI-C language and contains additional constructs in exploit-
ing and abstracting the complexities when programming hardware. A Handel-C program requires
that a clock construct is used. This is usually set to the clock rate of the target device. Groups of
statements may be enclosed within PAR and SEQ code blocks, indicating that the statements should
execute in parallel (in the same clock cycle) or in a sequence (one after the other), respectively. The
Handel-C speciﬁcation also introduces the idea of channels: a link between parallel branches of code
to allow intercommunication [93]
The Data StreamManager (DSM) is a C/C++ API which handles the communication between the host
40 Chapter 2. Background
application and the Handel-C code running on the FPGA. It was designed by Celoxica to enable OS-
independent hardware/software co-design between applications written in C/C++ on a microprocessor
host and Handel-C on a reconﬁgurable hardware target [94].
HyperStreams is a high-level abstraction library in Handel-C. It can produce a fully pipelined hard-
ware implementation with automatic optimization of operator latency at compile time. HyperStreams
is useful when the target implementation involves a complex algorithm [95]. Unfortunately Hy-
perStreams is no longer available from Celoxica and Handel-C has been made less popular due to
business reasons.
MaxCompiler from Maxeler Technologies is a high-level application acceleration tool chain on a
Maxeler FPGA system. MaxJava, which is essentially a Java API adopting a streaming programming
model, is used for describing the computationally intensive parts of an application; these descriptions
are called kernels. MaxJava supports customisable data formats. A kernel manager is also deﬁned by
the user to conﬁgure FPGA speciﬁc properties. A hardware kernel is generated based on the MaxJava
descriptions for kernels and the kernel manager, and is run on the Maxeler FPGA system [96]. In this
thesis, Chapter 4 contains work based on the Celoxica technology; the rest of the chapters are based
on the Maxeler technology.
Much research has been done to simplify hardware-software co-design and develop frameworks to
integrate reconﬁgurable hardware and software design. A parallel programming library has been
proposed to realise C# programmes on FPGAs [97]. An IGOL (Imaging and Graphics Operator Li-
braries) framework is proposed for developing reconﬁgurable data processing application [98]. A
reﬂective component model is used to build a middleware platform [99]. A design methodology is
presented which enables designers to combine cycle-accurate descriptions with behavioral descrip-
tions [100]. A framework for developing FPGA-based conﬁgurable computing machines applica-
tion is discussed [101]. A high-level component-based methodology and design environment for
application-speciﬁc multi-core SoC architectures is presented [102]. A component-based electronic
system level design ﬂow which supports abstraction and reuse is presented based on the Gezel lan-
guage [103]
2.9. Summary 41
2.9 Summary
This chapter provides ﬁnancial, mathematics and hardware platform background for this thesis. The
basic payoff evaluation and modelling of ﬁnancial options mentioned in Section 2.1 and Section 2.2
are used in Chapter 4 and Chapter 5; The modelling of interest rates and the payoff evaluations of
interest rate derivatives introduced in Section 2.3 is used in Chapter 3. There are also discussions
about popular numerical methods, different computational devices and related research works and
ideas in Section 2.4, Section 2.5, Section 2.6 and Section 2.7.
Chapter 3
Multi-level Customisation
3.1 Motivation
In the ﬁnance industry numerical methods such as Monte Carlo simulations play an important role
for derivatives payoff evaluations, as complex mathematical models without closed form solutions
are created to accommodate the growing complexity of ﬁnancial products. Interest rate modelling is
one of the most important ﬁelds in mathematical ﬁnance research to price ﬁxed income products. In
the past two decades this ﬁeld has evolved from modelling a single instantaneous interest rate [33] to
modelling the dynamics of an entire forward rate curve [35]. Modelling each curve has a complexity
of O(n2) in the number of time-steps, compared to the conventional single-point modelling (such as
stock option payoff evaluation) which has a complexity ofO(n). In a large ﬁnancial institution where
overnight sensitivity tests and risk calculations are vital to control the institution’s risk exposure and
are required by regulators, curve based Monte Carlo modelling can consume over 30% of the total
computational capacity on the corporate compute grid with tens of thousands of CPU cores. With
computational requirements doubling every year, hardware accelerators such as FPGAs and GPUs are
increasingly being used to ofﬂoad computationally demanding tasks from CPUs, in order to improve
performance while reducing power consumption and data center space.
This chapter proposes a customisable Monte Carlo framework for the automated generation of highly
efﬁcient curve based payoff evaluation accelerator, based on the Heath-Jarrow-Morton (HJM) math-
42
3.2. HJM Monte Carlo Simulation 43
ematical framework. The main contributions are:
• a ﬂexible Monte Carlo framework with multiple levels of functional specialisations which can
be used to generate FPGA solutions for different applications without using a soft processor.
The framework is designed to be platform independent and easily extendible to support CPU
and GPU implementations (Section 3.3);
• a domain speciﬁc language to enable automatic generation of application-speciﬁc components
and to support architecture specialisation to a particular application (Section 3.4);
• a process to identify the optimal ﬂoating point data format on target reconﬁgurable hardware
for our architecture (Section 3.4);
• an evaluation of the proposed framework by comparing processing speed and energy efﬁ-
ciency to general purpose processors and graphics processing units over three case studies (Sec-
tion 3.5).
3.2 HJM Monte Carlo Simulation
FPGAs are increasingly being used for the acceleration of Monte Carlo models in ﬁnancial simu-
lations. For instance, a platform independent domain speciﬁc language has been invented to pro-
duce optimised pipelined designs with thread level parallelism for Monte Carlo simulations from a
high level abstraction [21]; an FPGA-based stream accelerator with higher performance than GPUs
and Cell processors has been proposed for evaluating European options [56]; an architecture with a
pipelined datapath and an on-chip instruction processor has been reported for speeding up the Brace,
Gatarek and Musiela (BGM) interest rate model for derivatives evaluation [20]; an American op-
tion valuator using least-squares Monte Carlo method has been implemented [58]; a control variate
Monte Carlo design for Asian options is presented [57], and a successful FPGA project in industry
has been reported for collateralised default obligation (CDO) pricing [12]. However, most of the
existing work seeks optimisations and generalisations for single-point simulations, while the more
44 Chapter 3. Multi-level Customisation
complex implementations usually involve a less efﬁcient FPGA-based softcore to handle general con-
trol functions [20]. Moreover, a highly optimised complex hardware design is usually less ﬂexible,
hence problematic when changes occur frequently. The appropriate balance of performance and pro-
grammability of designs remains a challenging problem. We use the Heath-Jarrow-Morton mathe-
matical framework to illustrate our approach to design space exploration for Monte Carlo designs
with complex control.
Algorithm 1 HJM Monte Carlo Algorithm: a Single Path
Input: f(0, T ) = initial forward curve, σ = volatility model
Output: f(t, T ) = forward surface
1: for t=0 to tmax do
2: for T ′=0 to T ′max do
3: Calculate Drift: obtain σ(t, T ) and get μ(t− δt, t+ T ′) using Equation 3.2
4: Update forward Surface: get f(t, t+ T ′) using Equation 3.1
5: Price Derivative State 1: Use f(t, t+ T ′) to price the target derivative
6: end for
7: Price Derivative State 2: Use result from State 1 to price the target derivative
8: end for
As discussed in Section 2.3, the Heath-Jarrow-Morton (HJM) Framework [35] is a general framework
for modelling instantaneous forward interest rate curve. It differs from short rate models in the way
that it models the full dynamics of the entire forward interest rate curve, as opposed to a single point
on the curve which is the short rate r(t). The equation of the framework is shown in Equation 3.1:
df(t, T ) = μ(t, T )dt+ σ(t, T )dW (t) (3.1)
μ(t, T ) = σ(t, T )
∫ T
t
σ(t, u)du (3.2)
where f(t, T ) is the instantaneous forward rate at time T as seen from time t and 0 ≤ t ≤ T ; σ(t, T )
is the forward volatility column vector of sizeN , where N is the number of factors in the framework;
W (t) is a random variable under standard normal distribution. For convenience we call t the time and
T ′ the time offset from time t, with T = t+ T ′, therefore f(t, T ) ≡ f(t, t+ T ′).
It can be seen that along each Monte Carlo path, a surface constructed by f(t, T ) is generated. Fig-
ure 3.1 shows the evolution of the forward rate curve over time for an arbitrary path of a Monte Carlo
simulation. A general Monte Carlo algorithm for the HJM model is shown in Algorithm 1. We now
3.2. HJM Monte Carlo Simulation 45
?????????? ?
???
????????????????
?? ?? ???
????????????
??
????????????
???? ???
????
??????
????
?
Figure 3.1: The evolution of the forward rate curve from t = 0 to t = tmax in one Monte Carlo path,
giving ti = Ti = ti + T ′i−1
explain Algorithm 1 in detail. At t0 we start with a instantaneous forward curve f(0, T ). For simplic-
ity we assume the forward curve is described by an array of 10 numbers, indexed by T . Therefore
f(0, 0) denotes the instantaneous forward rate at t0, and f(0, 1) denotes the same at t1, assuming we
are looking at the curve at t0. As simulation proceeds we are now at t1, and the curve is denoted
by f(1, T ). Because we have moved forward in time, the forward rate at t0 no longer concerns us,
therefore T now starts with t1. Therefore f(1, 1) denotes the instantaneous forward rate at t1, and
f(1, 2) denotes the same at t2, assuming we are looking at the curve at t1. In this way the algorithm
can simulate how the forward curve change with time, which is important in the payoff evaluation of
interest rate derivatives.
From line 5 of Algorithm 1, the forward curve f(t, T ) is used as a basic building block to eval-
uate interest rate products. The HJM framework is ﬂexible in two ways: (a) the user can choose
which volatility model to use (line 3 in Algorithm 1) and (b) the payoff evaluation function differs for
different ﬁnancial products (lines 5 and 7 in Algorithm 1). Based on different assumptions and appli-
cations, different volatility functions σ(t, T ) can be chosen. Table 3.3 shows some parameter settings
for Equation 4 under different volatility models. Forward curves generated by the HJM framework are
used to value different ﬁnancial products. Table 3.4 shows a non-exhaustive list of valuation functions
for different interest rate products.
46 Chapter 3. Multi-level Customisation
Table 3.1: Parameters in our framework
Model Parameters
d number of factors in the frame-
work
t the variable that tracks time in
the model
T another time in the future, given
current time is t
T ′ time offset from t to T , T = t +
T ′
f(t, T ) the instantaneous forward rate at
time T , as seen from time t
r(t) short rate at time t
σ(t, T ) forward volatility, a d-
dimensional column vector
W (t) d-dimensional standard random
process
Statistical Test Parameters
X1 mean of the reduced precision
result
X2 mean of the “true” result
σ1 standard deviation of the re-
duced precision result
σ2 standard deviation of the “true”
result
n1, n2 number of sampling in the sim-
ulation to get the reduced preci-
sion result and the “true” result
t the t-statistic to test whether the
population means are different
d.f. degrees of freedom in signiﬁ-
cance testing
wE,wF number of exponent bits and
mantissa bits in a ﬂoating point
number
wI, wF number of integer bits and frac-
tional bits in a ﬁxed point num-
ber
3.3 Multi-level Customisation Framework
Figure 3.2 shows our proposed framework for the HJM model, which is independent of the choice
of volatility structure and interest rate product. We deﬁne our framework based on a procedure for
developing evaluators of ﬁnancial product payoff, in which three levels of functional specialisations
can be identiﬁed:
• heavily specialised modules do not change with applications and are platform dependent;
• mediumly specialised modules change occasionally with applications and can be platform
dependent;
• lightly specialised modules are application dependent but platform independent.
The procedure has two phases: in the model developing phase platform experts develop heavily
specialised modules and deﬁne templates for mediumly specialised modules and optimise them for
3.3. Multi-level Customisation Framework 47
????????????????????
????????????????
???????????????
??????????????
?????????
?????????? ?????
??????????
????????????????? ?????
????????????????????
????????????????????? ??????
???
???
???????????
????????
?????????????? ?
???
????
?
???
???
?
Figure 3.2: Proposed framework for the HJM model, dark grey indicates stable part in the kernel,
while white indicates ﬂexible parts.
potential target platforms; in the payoff evaluator developing phase users choose a mediumly spe-
cialised module as a base component and a target platform, on which platform dependent ﬁnancial
product payoff evaluators are generated using a platform independent domain speciﬁc language and a
special compiler. Since the user only programmes the platform independent lightly-specialised mod-
ule in the framework, we have a clear separation of tasks in the ﬁnancial product payoff evaluator
development procedure. In addition, since we expect the model developing phase to be an one-off
effort and the payoff evaluator developing phase to be a continuous effort thereafter, we have created
an acceleration procedure in which platform dependent expertise is not required from application
developers (users).
The main part of the framework is the HJM payoff evaluation kernel, which consists of three compo-
nents: volatility logic (mediumly specialised), interest rate generator (heavily specialised) and payoff
evaluation logic (lightly specialised). Volatility logic, corresponding to line 3 of Algorithm 1, is
ﬂexible in the model developing phase and it is stable in the payoff evaluator developing phase. Pre-
deﬁned templates are used to allow limited ﬂexibility in volatility logic; this part is developed by
collaboration between platform experts and users. Platform experts need to understand the user’s re-
quirements which are usually platform dependent. The interest rate generator, corresponding to line 4
48 Chapter 3. Multi-level Customisation
???????????
?????????
???????????
?????????
??????????????
?????????
?????????????????
??????????????????
??
??
??
??
????
???
??
??
??
??
?
??
??
??
???
???
??
??
??
??
??
???
??
???
?
??????????????????????????????
????????
????????????????
?????????????????
??????????????
?????????????????????
???????????
?????????????????
?????????????????
????????????????????
?????????????????????????
????????
?????
??????????????????? ?????????????????
?????????????????????
?????????????????
????????????????????
??????????
Figure 3.3: Proposed workﬂow for our Monte Carlo framework. Refer to Table 3.2 for details about
the input and output ﬁles (numbered).
of Algorithm 1, is stable by nature and can be developed by platform experts who also deﬁne the
interface between the module and the payoff evaluation logic. On the other hand, the payoff evalua-
tion logic is prone to change and we expect many instances of payoff evaluation logic to be created
in the payoff evaluator developing phase over a long time. A platform independent domain speciﬁc
language is used to allow non-experts to use the accelerated framework easily.
Figure 3.3 shows our proposed workﬂow. The user begins the payoff evaluator developing phase by
deﬁning the choice of underlying platform, volatility model, accuracy requirement and parallelisation
requirement in a conﬁguration ﬁle. The choice of underlying platform deﬁnes whether the user wants
the application to run on FPGA, CPU, GPU etc.; volatility model deﬁnes the interest rate engine;
accuracy requirement deﬁnes word length of the datapath, and parallelisation determines the number
of parallel datapaths in the system. Based on the conﬁguration ﬁle, the interest rate engine generator
combines an appropriate interest rate generator template and a volatility module template to produce
an interest rate engine description. The engine description is a programming ﬁle that describes the
target platform. The user writes domain speciﬁc programmes to utilise the interest rates generated by
the engine and then builds appropriate designs to evaluate interest rate derivatives. The platform inde-
3.4. Application Specialisation Flow 49
Table 3.2: Explanation of input and output ﬁles in our framework as illustrated in Figure 3.3.
Idx1 Created By Phase2 Specialisation3 P.D. 4 Purpose
1 User M low No For non-expert users to target his/her
design to a particular platform
2 User & Expert M Medium Maybe Optimised volatility model design
3 Expert M High Yes Optimised interest rate generator de-
sign
4 User P low No Payoff evaluation logic design without
the need of knowing the underlying
platform
5 Expert M high Yes Platform dependent glue logic, e.g.
moving data around, etc.
6 Framework P high Yes Platform dependent programming ﬁle,
e.g. VHDL, etc.
1 These indices identity the six types of ﬁles in Figure 3.3
2 The phase that creates the ﬁle. M stands for the model developing phase and P stands for the payoff
evaluator developing phase
3 Level of functional specialisation
4 Whether the ﬁle is platform dependent
Table 3.3: Volatility structures used in the HJM framework
Volatility Structure σ(t, T ) μ(t, T )
Constant1 α 1
2
α2[T 2 − (T − t)2]
Exponential1 αe−β(T−t) α
2
β
(e−2β(T−t) − e−β(T−t))
Stochastic2 σ˜(t, T )f(t, T ) σ(t, T )
∫ T
t
σ(t, u)du
1 α and β are calibrated model constants
2 σ˜ is a stochastic volatility process
pendent user programming ﬁle is compiled to a payoff evaluation logic description programming ﬁle
for the target platform, by the payoff evaluation logic compiler. The HJM kernel generator combines
the interest rate engine description and payoff evaluation logic description to produce a complete pro-
gramming ﬁle, which is then used as input to the target platform tool chain to generate executables.
3.4 Application Specialisation Flow
In this section we discuss the specialisation process in our framework. We propose a “C” style control-
based domain speciﬁc programming environment to demonstrate the programmability of our frame-
50 Chapter 3. Multi-level Customisation
Table 3.4: Example interest rate products
Target Instrument Payoff Evaluation Function
Bond B(t, T ) = exp
(
− ∫ T
t
f(t, u)du
)
Bond Option1 (B(t, T )−K)+
CMS2 Y (t, T ) = 1−B(t,T )∑T
a=t B(t,a)
Swaption (Y (t, T )−K)+∑Ta=tB(t, a)
CMS S.O.3 (Y (t, T1)− Y (t, T2)−K)+
1 (x)+ ≡ max(0, x)
2 Constant Maturity Swap
3 CMS Spread Option
work. The programming environment has the following assumptions:
• The programming environment provides a set of environment parameters P generated by its
underlying framework at each iteration (one clock cycle for FPGAs, one for-loop iteration for
CPUs, etc.), which can be utilised by the developer (user). This means that data are provided
in a temporal manner as opposed to the conventional spatial manner. Instead of requesting
a piece of desired data, the developer waits until the data are provided by the programming
environment.
• Operator latency is implicit; results appear to be produced instantaneously. This allows the
developer to use the framework without expertise in the target platform.
• The user can create model input variables, intermediate variables, accumulation logic, control
logic and nothing else. The set of input variables V , once declared, is treated as a data environ-
ment, in other words P ′ = P ∪ V .
• The user speciﬁes outputs from intermediate variables, and output conditions.
The language is natively supported by CPU and GPU implementations and can be used to generate
control and datapaths in hardware. We give a simpliﬁed deﬁnition of the grammar in Listing 3.1 in
order to provide an overview of our domain speciﬁc language, it is not intended to be rigorous or
comprehensive.
We now demonstrate the application specialisation process for reconﬁgurable hardware. To begin
with, we map our language to the hardware architecture. We assume P = {fi,j, discounti, dt, i, j},
3.4. Application Specialisation Flow 51
V = {Imax, Jmax} and P ′ = P ∪ V , as shown in Figure 3.4, where i is the index for time step t,
the incrementation of i depends on index j which is the index for time offset T ′; fi,j is the discretised
instantaneous forward rate at time Tj , as seen from time ti; dt is the time difference between Tj and
Tj−1 or ti and ti−1; and discounti is the discount factor to present time.
Listing 3.1: Simpliﬁed grammar for the domain speciﬁc language used in our
framework. Note that the grammar is subject to extension and it is intended to
show the capability of the language, hence it may neither be rigorous nor be
comprehensive.
<c o n f i g u r a t i o n > : : = <s t a t emen t >+
<s t a t emen t> : : = <c a l c s t a t em e n t >|< i o s t a t em e n t >
<c a l c s t a t em e n t > : : = < i d e n t> = <e xp r e s s i o n>
| if ( <e xp r e s s i o n> ) { <c a l c s t a t em e n t >+ }
[ else { <c a l c s t a t em e n t >+ } ] ?
< i o s t a t em e n t > : : = input ( < i d e n t> )
| output ( <e xp r e s s i o n> )
<e xp r e s s i o n> : : = < l i t e r a l >
|< unary op> <e xp r e s s i o n>
|< e xp r e s s i o n> <b i n a r y op> <e xp r e s s i o n>
|< func> ( <e xp r e s s i o n> [ , <e xp r e s s i o n >]∗ )
< l i t e r a l > : : = < e n v l i t e r a l > | <u s e r l i t e r a l >
The top box in Figure 3.4 shows the programming environment in which one P ′ is provided in each
clock cycle. The user relies on the programming environment to provide a correct set of parameters,
and assumes that variable j counts inside each variable i. The bottom box in Figure 3.4 shows a fully
pipelined design generated from the description in the top box. The n-buffer under the accumulator is
to hide pipeline latency of the accumulator by keeping a history of n previous value. This effectively
increases total pipeline latency of the datapath by n, however it allows full pipelining in the datapath
without using low performance un-pipelined accumulator. Latency balancing buffers are omitted in
52 Chapter 3. Multi-level Customisation
? ? ???? ?? ??????????? ?? ????
??????? ?????????????????
??????? ????
???????? ?????????? ?????????????????
?????????? ????????? ?????????????
????????????????????????????
???????? ?????????????????
????????????????????????????????
? ? ???? ?? ??????????? ?? ????
?
???
?
???
??
??
???
?
???????
???
?
????????????
?????????????????????
??????????????
??????????
Figure 3.4: Domain speciﬁc language Example 1.
the ﬁgures for simplicity. Figure 3.5 shows a possible extension based on Example 1, in which annuity
is calculated based on prices of zero bonds maturing over a time period.
Reconﬁgurable hardware supports customised word length of its datapath in order to optimise hard-
ware utilisation based on an accuracy requirement. Previous research focuses on ﬁne-grained bitwidth
optimisation, such as simulation [73], interval arithmetic [74], backward propagation analysis [75],
afﬁne arithmetic [76] and polynomial algebra [77]. However most of them are not straightforward for
complex Monte Carlo problems where multiple levels of combinational logic consisting of ﬂoating
point datapath and accumulators are combined with complex control-ﬂow and feedback paths. We
propose a purely statistical method to determine the optimal data format for reconﬁgurable hardware.
The accuracy of a ﬁnancial instrument payoff calculated by a numerical method is usually affected by
two main factors:
3.4. Application Specialisation Flow 53
? ? ???? ?? ??????????? ?? ????
??????? ?????????????????
??????? ???? ???????? ?????????? ?????????????????
????????????????????? ??????????????
????????????????????????????
?????????????????????????????????????????
???????? ???????????? ??????????????????????????????????
?????????????????????????????????????????
???????? ????????????????????????????????????
? ? ???? ?? ??????????? ?? ????
?
???
?
???
??
??
???
?
???????
???
?
????????????
???
?
???
????????
Figure 3.5: Domain speciﬁc language Example 2, extending Example 1.
• Discretisation error: the error caused by transforming the model from a continuous mathemat-
ical space to a discretised computational space. In our case when the Monte Carlo method
is used, the discretisation error comes from insufﬁcient sampling of the underlying random
source.
• Finite precision error: the error caused by using number representations of insufﬁcient accuracy.
The error can be ampliﬁed or diminished by numerical operators in the datapath. In Monte
Carlo simulations we expect the ﬁnite precision error to be a normally distributed random factor,
according to the law of large numbers and the central limit theorem.
We therefore deﬁne the measure of accuracy to be the observed error due to the combination of
discretisation error and ﬁnite precision error. As discussed in Section 2.4.2, Monte Carlo is a statistical
method relying on the law of large numbers, it is therefore possible to use Welch’s t-test to assess the
statistical signiﬁcance of the error in the result [64]. The test determines whether there is any statistical
54 Chapter 3. Multi-level Customisation
evidence suggesting the Monte Carlo result is different from the “true result”, which is the result
calculated by a high precision datapath, e.g. double precision. We therefore set the null hypothesis
to be that the Monte Carlo result and the true result are equal, assuming we know the true result and
its standard deviation beforehand. We use Equation 3.3 to calculate the t-value and Equation 3.4 to
calculate the degree of freedom (d.f.). These two values can then be used to obtain the p-value via
the Students-t CDF. The deﬁnition of the variables are shown in Table 3.1. The experiment will run
the Monte Carlo simulation using a customised ﬂoating point data format with wE bits of exponent
and wF bits of mantissa. For the sake of simplicity from now on we deﬁne wE = 8 for all ﬂoating
point number formats. We use the custom data format to build the datapath for a given Monte Carlo
payoff evaluation simulation and run different experiments with n1 starting from a smaller number
and incrementing towards inﬁnity. During the experiments we monitor the p-value, and once the p-
value falls below a pre-selected statistically signiﬁcant threshold (e.g. p=0.05 for 5% signiﬁcance)
the simulator is considered to have failed. If the test does not fail on the custom data format, we can
conclude that the result from the custom datapath is not statistically different from the data format
used to obtain the “true result”.
t =
X1 −X2√
σ21/n1 + σ
2
2/n2
(3.3)
d.f. =
(σ21/n1 + σ
2
2/n2)
2
(σ21/n1)
2
/(n1 − 1) + (σ22/n2)2/(n2 − 1)
(3.4)
Welch’s t test is used in this thesis under the assumption that the parameters of a target option stays
relatively stable. An experiment is carried out in the result section to conﬁrm this assumption. In the
case which the parameters move volatilely a continuous monitoring system is required to correct the
ﬁnite precision error. Such system consists of additional full precision datapaths and error correction
logic, which is described in [60].
3.5 Results
In this section we discuss the applications of our framework over three case studies: bond option,
swaption, and CMS spread option (CMS S.O.). The details of these options are listed in Table 3.4.
3.5. Results 55
We use the MaxWorkstation reconﬁgurable accelerating system from Maxeler Technologies to eval-
uate our framework. It has one MAX3 card with a Xilinx Virtex-6 SX475T FPGA. The card is
connected to an Intel i7-870 CPU through a PCI express link with a measured bandwidth of 2 GB/s.
The general purpose processor (GPP) in our comparison is a 4-core Intel i7-870 CPU running at
2.93 GHz.
We use the Intel Compiler (ICC) and the Intel Math Kernel Library for our software implementations.
The SFMT random number generator and the Box-Muller transformation provided by Intel Vector
Statistical Library (VSL) are used for random number generation. We have optimised the software
implementations to the best of our knowledge, to ensure the comparisons are fair and accurate.
The FPGA implementations are generated based on: user programming ﬁles compiled automatically
by our payoff evaluation logic compiler, and conﬁguration ﬁles, volatility templates, glue logic and
interest rate generator template written and optimised by hand. We use the MaxCompiler as our
high level synthesis tool and our payoff evaluation logic compiler generates intermediate descriptions
compatible with the MaxCompiler based on our domain speciﬁc language. We use one CPU core to
drive the FPGA in our case studies. The payoff evaluation logic compiler is generated by ANTLR
parser generator [104]. The hand-optimised interest rate generator consists of a LUT Optimised
uniform random number generator [105], a wrapper to transform uniform random numbers to standard
normal random numbers [106] and a ﬂoating point exponential operator [107]; other components are
generated by Xilinx CoreGen.
We use Welch’s t-test described in Section 3.4 to determine the optimal ﬂoating point and ﬁxed point
number format to use on FPGA. The test is designed so that we have a set of applications A, let Ai
(i ∈ N) indicate the ith application in set A, and Ai,wF indicate a variation of application Ai using a
speciﬁc ﬂoating point data format with wF mantissa bits. We deﬁne a test result to be a tuple (Ai,wF ,
n), which is obtained from a Monte Carlo simulation of n paths using application Ai,wF . The test
result is then compared to a reference result obtained from (Ai,53, 109), which is a double precision
variation of Ai running one billion Monte Carlo paths. We assume that all reduced precision data
formats under consideration have 8 exponent bits (wE = 8) for ﬂoating point numbers and 9 integer
bits (wI = 9) for ﬁxed point numbers based on two’s complement number representation. We use
56 Chapter 3. Multi-level Customisation
????
????
????
????
????
????
??
??
??
??
?
??
??
?
??
??
?
??
??
?
??
??
?
??
??
??
??
??
??
?
??
??
??
?
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
?
??
??
??
??
?
??
??
??
?
?????????? ???????????????
??
??
??
??
??
??
??????
????
Figure 3.6: p-value in log scale for Bond Option based on ﬂoating point number representation.
Equation 3.3 to calculate the t-value and Equation 3.4 to calculate the degrees of freedom, from which
the p-value can be obtained from standard t-distribution tables. Setting the null hypothesis to state
that the result obtained from a reduced precision datapath is the same as the result obtained from a
double precision datapath, if p-value is smaller than or equal to the signiﬁcance level (p = 0.05), the
reduced precision result is then rejected, since there is enough information to tell that the reduced
precision result and the double precision result are from two different distributions. Otherwise the
observation is consistent with the null hypothesis and we consider the reduced precision result to be
not statistically different from the double precision result.
In the experiment we have A = { Bond, Bond Option, CMS, Swaption, CMS Spread Option},
8 ≤ wF ≤ 23 and 5000 ≤ n ≤ 2 × 109. We use the MPFR library [108] to build reduced precision
datapaths in order to carry out the experiments in a scalable way; however, because the calculations
mirror those performed in hardware, the results also apply to the FPGA datapath. In this research we
discuss two representative cases: Bond Option and Swaption. Figure 3.6 shows different p-values of
the bond option application based on ﬂoating point with different number of bits for mantissa over
various number of Monte Carlo runs. The data formats with wF ≤ 10 are ignored as they fail the
t-test when n = 5000. The 11-bit mantissa version fails when n = 106, which means there is no
statistically signiﬁcant evidence to tell the reduced precision result from the double precision result
3.5. Results 57
????
????
????
????
????
??
??
??
??
?
??
??
?
??
??
?
??
??
?
??
??
?
??
??
??
??
??
??
?
??
??
??
?
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
??
?
??
??
??
??
?
??
??
??
??
?
??
??
??
??
?
??
??
??
?
?????????? ???????????????
??
??
??
??
??
??????
????
Figure 3.7: p-value in log scale for Swaption based on ﬂoating point number representation.
when n < 106. Therefore if the user only intends to run Monte Carlo simulations of n < 106 paths,
he/she can get an answer not statistically signiﬁcantly different from the double precision result by a
reduced precision datapath. On the other hand, with a 17-bit mantissa we do not fail until 5 × 108
samples, which means the 17-bit version can produce a good enough result for n < 5 × 108. Other
variations with wF ≤ 17 all fail t-tests before n = 5× 108.
Figure 3.7 shows p-values for Swaption application based on ﬂoating point number representations,
which is an extreme case where the 17-bit mantissa version fails t-test when n = 3 × 108; it means
that we need to increase the mantissa bits if n exceeds 3× 108. However, this is the only application
where the 18-bit mantissa version fails t-test. On the other hand, we don’t see any 23-bit mantissa
version fails any t-test during our experiments. In ﬁnance industry Monte Carlo simulations usually
use n = 5× 104, which is well below the 3× 108 threshold, so ﬂoating point numbers with 17 to 23
bit mantissas should be good enough to produce reliable results.
The corresponding results for ﬁxed point number representations are illustrated in Figure 3.8 and
Figure 3.9. It can be seen that though ﬁxed point numbers requires more fractional bits to survive
the t-test when n = 5000, any ﬁxed point number representation with wF ≥ 19 only fails the t-test
when n ≥ 5× 108. This means ﬁxed point number representations with wF ≥ 19 can produce good
enough result for n < 5× 108.
58 Chapter 3. Multi-level Customisation
????
????
????
????
????
??
??
??
?
?????????? ???????????????
??
??
??
??
??????
????
Figure 3.8: p-value in log scale for Bond Option based on ﬁxed point number representation.
????
????
????
????
????
??
??
??
?
?????????? ???????????????
??
??
??
??
??????
????
Figure 3.9: p-value in log scale for Swaption based on ﬁxed point number representation.
In order to verify the assumption we made in using Welch’s t test, experiment illustrated in Figure 3.6
is repeated with 10 randomly generated bond options based on the original experiment. The param-
eters of the bond options are generated so that the absolute difference between the new value and
original value is less than 10% of the original value. The result is shown in Figure 3.10. It can be seen
that based on the samples we draw from the entire population, we are 99.7% conﬁdent that the result
of Welch’s t test will still hold given that the parameters of the bond option move less than 10% of the
original value. If in reality the parameters move more volatilely than 10%, a continuous monitoring
system is required to correct the ﬁnite precision error, such system is proposed by [60].
3.5. Results 59
?
????
???
????
???
????
???
????
???
????
????
????
????
????
????
??????????????????
??
??
??
??
??
???
?
?????????? ???????????????
??????????????
??????????????
??????????????
??????????????????
????????????????????????
????????????????????????
?????????????????????????
Figure 3.10: Mean and standard deviation of p-value of 10 randomly generated bond options based
on the experiment illustrated in Figure 3.6. Note that the left y axis is for the mean of p-value and it
is in log scale, and the right y axis is for the standard deviation of p-value and it is in normal scale.
Table 3.5 shows a resource consumption comparison between double precision and reduced precision
ﬂoating point implementations. It can be seen that when double precision is used in the datapath, all
implementations are bounded by Block RAM resource, since the designs require signiﬁcant amounts
of FIFO buffer to pipeline accumulators and to retime pipelines with long delay. We can only ﬁt 2-3
double precision cores on the FPGA and utilise around 20% of the logical hardware resource. On
the other hand, if we use reduced precision data format (wE = 8, wF = 17), given that the result is
statistically equivalent to the double precision result, we ﬁnd Block RAM resource usage reduced by
15 to 20 times, and the design is now bounded by logic resources instead. Note that in this case the
Block RAM resource usage reduction is optimised by tool chain and not by hand. This means that we
can utilise more area on the FPGA to do computation and expect a higher throughput. The top clock
frequencies also increase by about 1.4 times accordingly.
Further resource consumption reduction is achieved by utilising ﬁxed point number representations, as
shown in Table 3.6. The t-test result indicates that ﬁxed point number representations with wF ≥ 19
will produce acceptable result. If high precision ﬁxed point data format (wI = 9, wF = 23) is used,
each kernel consumes about 80% of bottleneck hardware resource compared to a (wE = 8, wF = 17)
ﬂoating point kernel. If reduced precision ﬁxed point (wI = 9, wF = 19) numbers are used, the
bottleneck resource reduction compared to the corresponding reduced precision ﬂoating point kernels
60 Chapter 3. Multi-level Customisation
Table 3.5: Floating Point Resource Comparison: wE = 8, wF = 53 (double precision) and wE =
8, wF = 17 (reduced precision) on a Virtex-6 SX475T FPGA. Note that the numbers are based on
place and route results
Bond Option Swaption CMS S.O. Device
Num. Mantissa Bits 53 17 53 17 53 17 -
LUT (%) 6.2 3.26 7.95 3.77 11.64 5.1 297600
FF (%) 4.04 2.18 5.43 2.5 7.94 3.33 595200
BRAM (%) 28.76 1.88 29.04 1.88 41.82 2.02 1064
DSP (%) 6.55 1.39 7.04 1.49 8.09 1.74 2016
Clock Freq. (MHz) 195 270 185 265 170 230 -
Normalised Area 15x 1x 15x 1x 20x 1x -
Normalised Freq. 1x 1.4x 1x 1.4x 1x 1.35x -
Table 3.6: Fixed Point Resource Comparison: wI = 9, wF = 23 (high precision ﬁxed point) and
wI = 9, wF = 19 (reduced precision ﬁxed point) on a Virtex-6 SX475T FPGA, compared to the
corresponding wE = 8, wF = 17 ﬂoating point kernels in Table 3.5. Note that the numbers are based
on place and route results
.
Bond Option Swaption CMS S.O. Device
Num. Fraction Bits 23 19 23 19 23 19 -
LUT (%) 2.44 2.2 2.79 2.55 3.72 3.36 297600
FF (%) 1.77 1.62 2.01 1.86 2.7 2.47 595200
BRAM (%) 2.44 2.02 2.44 2.02 2.61 2.17 1064
DSP (%) 2.58 2.43 2.77 2.61 3.23 3.04 2016
Clock Freq. (MHz) 265 270 260 265 240 240 -
Normalised Area 0.79x 0.75x 0.74x 0.69 0.73x 0.66x -
Normalised Freq. 0.98x 1x 0.98x 1x 1x 1x -
is even further: 25% resource reduction for the Bond Option kernel; 31% for the Swaption kernel and
34% for the CMS Spread Option kernel.
Table 3.7 contrasts GPP and GPU implementations with the FPGA implementations generated by our
framework. Each FPGA implementation uses about 80% of the total logic resource available on the
FPGA to avoid congestion in the place and route phase. The FPGA implementations are based on a
reduced precision ﬂoating point data format (wE = 8, wF = 17) and a reduced precision ﬁxed point
data format (wI = 9, wF = 19). The testing cases are Monte Carlo simulations of 100 million paths
for GPP and FPGA, and an 89.6 million paths Monte Carlo simulation for GPU. The GPU testing
case is designed to ﬁt the GPU parallelism granularity to ensure fair comparison. It can be seen that
the FPGA implementations are at least over 40 times faster than software implementations utilising
one of the four CPU cores. They are at least about 11 times faster than the corresponding software
implementations utilising all four CPU cores. It is not surprising to see that the FPGA CMS spread
3.5. Results 61
option implementation is the slowest, since it requires complex logic to sample two different sections
on the forward curve, which implies larger kernels, slower clock frequency and fewer parallel kernels
in the FPGA. The software implementation suffers less from the increase of complexity since the two
samplings are independent of each other and the instructions can be efﬁciently pipelined.
We use an Ethernet-connected power measuring socket from Oslon electronics to measure average
power consumption of the system, with a measuring resolution of 1 sample per second. As shown
in Table 3.7, it is not surprising to see that FPGA implementations are generally over 20 times more
energy efﬁcient than Quad-Core software implementations, given that all power readings include idle
power consumption of the system.
We now discuss our Graphics Processing Unit (GPU) benchmark to compare the FPGA implemen-
tations generated by our framework. The GPU is an NVIDIA Tesla C2070 device with 448 cores
running at 1.15 GHz and has a peak double precision performance of 515 GFlops. The benchmark
implementation on bond option is based on the standard parallel random number generator provided
by CURAND Library and nvcc compiler with maximum optimisation ﬂags turned on. The imple-
mentation is manually optimised so that access to off chip memory only occurs at the beginning and
at the end of the kernel launch. Warp level control is used to ensure the kernel only accesses on chip
cache during the execution without any bank conﬂict or branch divergence. As shown in Table 3.7,
the GPU implementation is about three times less energy efﬁcient and the corresponding FPGA im-
plementation is about 41% faster. Given the fact that both devices are using the 40nm technology, it
can be seen that the FPGA implementations are gaining speed advantage and energy efﬁciency from
customisable data format, fully pipelined datapath and lower clock frequency.
It is difﬁcult to make precise qualitative comparison between our approach and the traditional hand
written approach, in terms of development time and quality of code. However, when compared with
simple hand-written designs using a high-level programming language [43], which requires the user
to write hundreds of lines of code, our automated approach requires less than ten lines of code (Fig-
ure 3.4).
62 Chapter 3. Multi-level Customisation
Table 3.7: Comparison of MC simulations using double precision GPP(SW), reduced precision FPGA
and single precision GPU.
Bond Option Swaption CMS S.O.8
Device SW3 FPGA1 FPGA2 GPU7 SW FPGA1 FPGA2 SW FPGA1 FPGA2
Clock Freq. (MHz) 2930 160 150 1150 2930 150 150 2930 150 150
Num. of Cores 4 26 30 448 4 20 24 4 16 21
Num.Evaluations (B)6 177 177 177 154 177 177 177 177 177 177
Exe. Time (Seconds)11 476 50.3 41.1 50.5 738 61.2 51.1 822 73.75 60.1
Power Consumption (W)5 183 87 87 240 184 87 86 184 85 87
Energy Efﬁciency4 2.0 40.4 49.5 12.7 1.3 33.2 40.3 1.2 28.2 33.9
Speedup vs Single-Core9 4.0x 37.9x 46.3x 32.8x 4.0x 48.2x 57.8x 4.0x 44.6x 54.7x
Speedup vs Quad-Core10 1.0x 9.5x 11.6x 8.2x 1.0x 12.1x 14.4x 1.0x 11.1x 13.7x
Normalised Energy 19.9x 1.0x 0.8x 2.8x 25.5x 1.0x 0.8x 24.1x 1.0x 0.8x
1 Reduced precision ﬂoating point number representation ((wE = 8, wF = 17))
2 Reduced precision ﬁxed point number representation ((wI = 9, wF = 19))
3 The software utilises all 4 physical cores by process level parallelism
4 Measured in number of evaluations/second/Joule
5 The idle power consumption of the system is 80W
6 Number of point evaluations in the simulation, measured in billions of f(t, T ) calculation
7 The benchmark GPU is an NVIDIA Tesla C2070 device
8 CMS Spread Option
9 Speedup against one core of a quad-core CPU
10 Speedup against all four cores of a quad-core CPU
11 The execution time is measured ignoring data transfer time to GPU and FPGA
3.6. Summary 63
3.6 Summary
This chapter proposes an application independent Monte Carlo framework for interest rate derivatives
payoff evaluations based on the HJM model. By identifying three levels of functional specialisations
in the model, we allow a manually optimised component, a templated component and a user pro-
grammable component in our framework, in order to retain good performance and support multiple
applications. The framework is designed to be platform independent and easily extendible to support
CPU and GPU implementations. To specialise our framework to a particular application, we propose
a domain speciﬁc language for the user programmable component. We also propose a process for
the FPGA platform to identify the optimal reduced precision data representation format to ensure
maximum utilisation of hardware resource. We have shown that, by adopting optimal number rep-
resentation in the datapath, we can reduce the memory resource usage by 15 to 20 times, allowing
better utilisation of logic resource.
The designs generated by our framework for a Xilinx Virtex-6 SX475T FPGA are generally about
50 times faster than a single-core implementation on a i7-870 quad-core CPU at 2.93 GHz, are over
10 times faster and 20 times more energy efﬁcient than 4-core implementations on the same i7-870
quad-core CPU, and are three times more energy efﬁcient and over 40% faster than an NVIDIA Tesla
C2070 GPU at 1.15 GHz.
Chapter 4
Explicit Finite Difference Method
4.1 Motivation
In Chapter 2 we have discussed various numerical methods for option payoff evaluation. Amongst
which, the ﬁnite difference method [30] is a fundamental numerical procedure being used in situations
where no closed-form solution exists. In particular, it can be used to handle certain types of options,
such as American options, that cannot be easily modelled by Monte Carlo methods.
The ﬁnite difference method solves Partial Differential Equations (PDEs) by discretising the under-
lying variables in the PDE. The method is applied in many scientiﬁc domains, examples include the
Laplace Heat Equation in thermodynamics [51], Maxwell’s Wave Equation in electromagnetism [52]
and the Black-Scholes equation in ﬁnance [30].
In many ﬁnancial institutions, the ﬁnite difference method is used in calculating risk positions for
large portfolios, which can take many hours even on a large computer grid. The amount of computa-
tion needed in a ﬁnite difference problem increases cubically with accuracy, and exponentially with
the number of random variables [61]. Our aim is to explore acceleration technologies which can sig-
niﬁcantly reduce both execution time and size of computer grid required to execute the computation,
without affecting result accuracy.
This chapter shows how Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units
64
4.2. Option Payoff Evaluation 65
(GPUs) can provide viable methods for accelerating the ﬁnite difference methods as applied to dif-
ferent PDEs. We mainly discuss designs based on ﬁnite difference methods for European option,
American option and Asian option valuations. These designs are also ﬂexible enough to handle more
complex options, and other types of PDEs such as the Heat Equation. The main contributions are:
• a categorisation of explicit ﬁnite difference method for payoff evaluation of different options,
a theoretical workﬂow and a hardware independent parallel architecture based on the workﬂow
which supports those options (Section 4.3);
• a method for customising the parallel architecture to different option types; and an analytical
model to estimate the performance of the customised architecture, allowing performance tuning
before the actual implementation, with given hardware resource constraints (Section 4.3);
• three case studies to evaluate the performance and energy consumption of the customised im-
plementations in reconﬁgurable hardware, by contrasting them to alternative implementations
based on two NVIDIA GPUs and two general-purpose Intel processors from different technol-
ogy generations (Section 4.5);
• a discussion of energy consumption, parallel scalability and potential performance trend of the
proposed architecture (Section 4.5).
4.2 Option Payoff Evaluation
We now recap the ideas behind option valuation as discussed in Chapter 2. An option provides the
holder the right but not the obligation to buy from, or sell to, a third party a speciﬁed quantity of an
underlying asset at a predetermined price,K, known as the strike price. The option contract is written
by the seller and purchased by the buyer, whereby a put option enables the holder to sell at K, or a
call option gives the holder the right to buy at K. The rights conveyed by puts and calls will only be
exercised by the holder if the prevailing spot price S, is below K for a put (payoff is max(K − S, 0))
and above K for a call (payoff is max(S −K, 0)).
66 Chapter 4. Explicit Finite Difference Method
Table 4.1: Various option types and their corresponding payoff evaluation methods
Option Type Multi-Exe.1 Path Dep.2 EFD Impl. Cmplxty.3 MC Impl. Cmplxty. EFD Scal.4 MC Scal.
European Low Low
1 to 3 Vars 1to Many Vars
American   Low High
Asian  Low Low
Barrier  Low Low
Bermudan   Low High
1 Multiple exercise opportunities before option expiration
2 Path dependency, whether the option payoff depends on the entire path of the underlying asset price
movement before expiration
3 Implementation complexity, whether the method involves complex numerical operations such as matrix
multiplication
4 Scalability of the method, in terms of supporting multiple underlying assets (variables)
4.2.1 Categorisation
There are three main types of option payoff:
• European (single exercise opportunity at the maturity of the option),
• American (exercisable any time during the life of the option),
• Asian (single exercise opportunity at expiry, but payoff depends on the average underlying price
over a period of time).
American exercise options are generally worth more than European exercise options as they convey
more ﬂexibility around the timing of exercise. However, valuing the extra ﬂexibility involves dealing
with the path dependence that is introduced into the valuation model, which results in more complex
models. Similarly, Asian options also exhibits path dependency due to its special payoff function,
hence requiring a different class of valuation models compared to American options. In particular, it
is difﬁcult and computationally expensive to deal with path dependence using Monte Carlo models.
Finite difference techniques provide an alternative way of valuing American exercise options.
There are other types of exotic options such as barrier options and Bermudan options, which are
essentially variations of European options and American options with extra conditions in the payoff
function. Table 4.1 summarises and compares the option types covered in this chapter.
4.2. Option Payoff Evaluation 67
Table 4.2: Existing PDE solvers
Application Implementation
The Heat Equation [54] 24 bit Fixed Point
Maxwell’s Equation [55] 16 bit Fixed Point
Poisson Problem
(Iterative Reﬁnement) [111] Mixed 32/64 bit Floating Point
4.2.2 Existing Work
Hardware acceleration of ﬁnancial instrument payoff evaluation using Monte Carlo methods has been
studied intensively in the past few years. For instance, a platform independent domain speciﬁc lan-
guage has been invented to produce optimised pipelined designs with thread level parallelism for
Monte Carlo simulations from a high level abstraction [21]; an FPGA-based stream accelerator with
higher performance than GPUs and Cell processors has been proposed for evaluating European op-
tions [56]; an architecture with a pipelined datapath and an on-chip instruction processor has been
reported for speeding up the Brace, Gatarek and Musiela (BGM) interest rate model for derivatives
payoff evaluation [20]; an American option valuator using least-squares Monte Carlo method has
been implemented [58] and a control variate Monte Carlo design for Asian options is presented [57].
Some recent studies have focused on pipelined tree based methods [109] and quadrature methods [110].
Tree based methods are efﬁcient, and can handle certain types of options such as American options
that are difﬁcult to handle with Monte Carlo methods, while quadrature methods can provide more
accurate results than tree based methods in certain cases. However, it is mathematically more difﬁ-
cult to convert an abstract mathematical model to these forms. By comparison, the ﬁnite difference
method is well studied in the past few decades for its correctness, convergence and stability; it is also
mathematically easy to apply and computationally easy to implement. As a result it is widely used in
the industry.
Table 4.2 summarises previous FPGA acceleration of ﬁnite difference and iterative reﬁnement meth-
ods. However, these are highly domain speciﬁc and not suitable for ﬁnancial option payoff evaluation.
On the other hand, the work presented in this chapter is ﬂexible enough to be generalised to solve other
types of PDEs such as the Heat Equation.
68 Chapter 4. Explicit Finite Difference Method
4.2.3 The Finite Difference Method
The ﬁnite difference method solves the PDE of an option by discretising both time and the price of
the underlying asset S, and mapping both onto a two-dimensional grid. There are three main kinds
of ﬁnite difference methods in common use: implicit, explicit and Crank-Nicolson. We consider the
explicit mechanism as it is the most intrinsically parallelisable method of the three. The other two,
though more numerically stable (which is beneﬁcial for calculating risk sensitivities) involve solving
systems of M linear equations at each iteration [30]. In this subsection the explicit ﬁnite difference
method is introduced by two example PDEs.
Black Scholes PDE
The Black Scholes PDE with one variable (asset) has the following form:
∂v
∂t
+ (r − q)S ∂v
∂S
+
1
2
σ2S2
∂2v
∂S2
= rv (4.1)
where v denotes the price of the option, it is a function of S and t hence is sometimes noted v(S, t).
S denotes the value of the underlying asset, t denotes a particular time, r is the risk-free interest rate,
σ is the volatility of the underlying asset, and q is the dividend yield paid by the underlying asset. For
simplicity, in the rest of this chapter we assume that the underlying does not pay dividends therefore
q = 0.
Suppose the time to maturity for a put option is T . We discretise T by dividing it into N equally
spaced intervals: Δt = T/N , and calling the beginning of each interval a time step. As a result, a
total of N + 1 time steps are considered:
0, Δt, 2Δt, ..., (N − 1)Δt, T
We then determine an asset price Smax such that v(S, T ) < , and  is a sufﬁciently small number
close to zero. The asset price space is divided into M − 1 equally spaced intervals in the following
4.2. Option Payoff Evaluation 69
way, resulting in M underlying asset price steps:
S0 −
⌈
M − 1
2
⌉
ΔS, ..., S0 −ΔS, S0, S0 +ΔS, ..., S0 +
⌊
M − 1
2
⌋
ΔS
where
ΔS =
Smax − S0

(M − 1)/2 ,
S0 −
⌈
M − 1
2
⌉
ΔS > 0,
S0 +
⌊
M − 1
2
⌋
ΔS = Smax.
S0 denotes the value of the underlying asset at time t = 0.
A (N + 1) by M grid is deﬁned by the time and asset price points.
Applying explicit discretisation where
∂v
∂t
=
vi+1,j − vi,j
Δt
(4.2)
∂v
∂S
=
vi+1,j+1 − vi+1,j−1
2ΔS
(4.3)
∂v2
∂2S
=
vi+1,j+1 + vi+1,j−1 − 2vi+1,j
ΔS2
, (4.4)
we obtain the basic discretised form of the PDE:
vi,j = αvi+1,j−1 + βvi+1,j + γvi+1,j+1 (4.5)
α =
1
1 + rΔt
(−1
2
rjΔt+
1
2
σ2j2Δt) (4.6)
β =
1
1 + rΔt
(1− σ2j2Δt) (4.7)
γ =
1
1 + rΔt
(
1
2
rjΔt+
1
2
σ2j2Δt). (4.8)
However, the basic discretised form of the PDE is numerically less efﬁcient since the coefﬁcients α,
β and γ are dependent on the underlying asset price index j.
An efﬁcient approach to compute within this grid can be obtained by a change of variable tech-
70 Chapter 4. Explicit Finite Difference Method
?
?
?? ?????
???????
?????????
??????????
?
?
??? ??
???????????
???? ??????
??
??
?? ?? ????
?????
?????
?????
??
?
???
???
Figure 4.1: A 5 × 6 ﬁnite difference grid, the grey elements show an EFD stencil consists of vi,j and
the three values it depends on at time step i and asset price step j. Note that (x)+ ≡ max(x, 0)
nique [30]. By discretising over Z = lnS, Equation 4.9 can be obtained, where α, β and γ are
constants independent of j.
vi,j = αvi+1,j−1 + βvi+1,j + γvi+1,j+1 (4.9)
α =
1
1 + rΔt
(− Δt
2ΔZ
(r − σ2/2) + Δt
2ΔZ2
σ2)
β =
1
1 + rΔt
(1− Δt
ΔZ2
σ2)
γ =
1
1 + rΔt
(
Δt
2ΔZ
(r − σ2/2) + Δt
2ΔZ2
σ2)
Figure 4.1 illustrates how Equation 4.5 updates nodes in a 5 × 6 grid. As an initial condition (bound-
ary condition), the values for nodes in the rightmost column is calculated by max(K−SN+1,j, 0) and
the algorithm runs from right to left.
The European put option payoff can be obtained by
vEUi,j = vi,j. (4.10)
With everything else being the same, Equation 4.11 can be used to calculate American option payoff,
4.2. Option Payoff Evaluation 71
where K is the strike price of the American option, K − jΔS is the early exercise price.
vAMi,j = max(vi,j, K − Sj) (4.11)
Barrier option evaluations can be done in a similar way. Equation 4.12 shows the payoff function for
a up-and-out barrier, where B is the barrier level and B > S0.
vBRi,j =
⎧⎪⎪⎨⎪⎪⎩
vi,j, if Sj ≤ B
0, otherwise
(4.12)
Similarly Equation 4.13 is used for Bermudan options payoff evaluation, where D is a set of pre-
deﬁned exercise dates and Ti = iΔt.
vBMi,j =
⎧⎪⎪⎨⎪⎪⎩
max(vi,j, K − Sj), if Ti ∈ D
vi,j, otherwise
(4.13)
Equation 4.10 and Equation 4.11 are the payoff functions of the options under discussion in this
chapter. Though Barrier and Bermudan options have different payoff functions, they share the same
EFD scheme as European and American options. We now look into the Asian option PDE.
Asian PDE
Asian option is different from European or American options due to the its special payoff function.
Equation 4.14 shows the payoff function of an Asian call option, where S¯T is the arithmetic average
of S between time 0 and T under continuous sampling.
(S¯T −K)+ (4.14)
72 Chapter 4. Explicit Finite Difference Method
As the Black Scholes PDE does not capture the Asian type path dependence, the Asian PDE is re-
quired to valuate Asian options [112], as shown in Equation 4.15:
∂u
∂t
+
1
2
(z − e−tqqt)2σ2∂
2u
∂z2
= 0 (4.15)
u(T, z) = u(t, zN) = (z −K1)+ (4.16)
where Equation 4.16 is the initialisation function, u is a transformed variable representing the dynam-
ics of the option payoff, q represents a continuous dividend yield, and qt is an intermediate variable
deﬁned in Equation 4.21. The price v of the Asian Option can then be obtained by:
v = S0 · u(0, z0) (4.17)
where S0 is the price of the underlying asset at time zero and z is the variable under discretisation,
initialised by Equation 4.23.
Similar to the Black Scholes PDE, by applying:
∂u
∂t
=
ui+1,j − ui,j
Δt
(4.18)
∂2u
∂z2
=
ui+1,j+1 + ui+1,j−1 − 2ui+1,j
Δz2
(4.19)
to Equation 4.15, and assuming q = 0, we get the discretised form of the Asian PDE, as shown in
Equation 4.20, where a, b and ϕ are the coefﬁcients of the function. Each option pricing problem has
a difference update function.
ui,j = (1− A)ui+1,j + 1
2
· A(ui+1,j+1 + ui+1,j−1)
= A(i, j)(ui+1,j+1 + ui+1,j−1 − 2ui+1,j) + ui+1,j (4.20)
4.2. Option Payoff Evaluation 73
?
?
?? ?????
????? ??? ???
???? ??????
????????
????????
??
?????
???
???
??????
??? ?? ????????
??? ?????????
?
?
???????????
Figure 4.2: Explicit Finite Difference Grid for Asian Option Pricing.
where A(i, j) =
Δtσ2(zj − qt)2
Δz2
= a(zj − b+ e
rti
ϕ
)2
qti =
1
rti
(1− e−r(T−ti))
a =
Δtσ2
Δz2
, b =
1
rT
, ϕ = rTerT (4.21)
z0 =
1
(r − q)T (e
−tiq − e−rti)− e−rT K
S0
(4.22)
= −e−rT K
S0
, ti = iΔt (4.23)
since t0 = 0 and q = 0.
As shown in Figure 4.2, The Asian option PDE can be discretised by a grid similar to Figure 4.1. In
fact, such a grid is typical for many option pricing problems, as long as a PDE is provided. However,
the equations for discretised forms can vary between PDEs. In Section 4.3 we discuss a parallel
architecture for PDEs.
74 Chapter 4. Explicit Finite Difference Method
4.3 Parallel Architecture Exploration
The ﬁnite difference scheme discussed above can be implemented efﬁciently in software by two
nested for-loops iterating round a one dimensional array, with an outer loop stepping t backwards from
T to 0, and an inner loop calculating the price for each node at level t in the grid. The array needs
to be large enough to hold intermediate values at each time step, therefore the amount of memory
required in the implementation is dependent on ΔS but not on Δt. However, realising the ﬁnite
different model in hardware, allowing parallel scalability while retaining full-pipelining, generality
and ﬂexibility in the design, is less straightforward. This section discusses the categorisation and
hardware independent architecture of the EFD method.
4.3.1 Categorisation of EFD
It is proposed that, when designing FPGA architectures, the explicit mechanism can be classiﬁed into
four categories in terms of parameters change per node calculation in the EFD grid over space and
time, according to the discussion in Section 4.2.3.
• Constant problems are the most hardware resource efﬁcient, as the parameters can be pre-
calculated by CPU and transferred to the FPGA at startup. The same parameters can then be
used throughout the pricing procedure, allowing over 50% less hardware resource utilisation
compared to time-space variant problems. Simple European option pricing with change of
variable technique falls into this category (e.g. Equation 4.10), since the update coefﬁcients do
not change with time or asset price levels.
• Time variant problems can be optimised to allow pre-calculation of the parameters for the next
time step which update the current time step , so are more logic resource intensive than constant
problems, but can use area efﬁcient sequential operators for coefﬁcients calculation.
• Space variant problems can be optimised to construct a lookup table of parameters before-
hand, which can be re-used at each time step, and therefore it is more memory intensive but
requires less calculation than time variant problems. American option pricing is one of these
4.3. Parallel Architecture Exploration 75
?? ?
???????
?????
???????
?? ???????
???????
??????????
???????
?????????????
???????
????????
??????????????????????
??????????????????
??
??
??
????
????
???
Figure 4.3: Explicit Finite Difference Method: Four Categories.
problems, as parameter Sj varies at each asset price step, but it does not change with time (e.g.
Equation 4.11).
• Time-Space variant problems cannot be optimised like the other three, and all parameters cal-
culations need to be done in hardware, so it is the most logic resource intensive yet most general
amongst the four categories. Asian option pricing falls into this category (e.g. Equation 4.20).
The relationship between the four categories is shown in Figure 4.3, with the Time-Space variant
problems being the most generic and Constant problems being the least generic.
Figure 4.4 shows the workload distributions for each the four categories in a heterogeneous system
with a CPU and an FPGA. For constant problems, since the parameters only need to be calculated once
per option, the host processor pre-calculates these parameters and send them to FPGA at initialisation,
in order to minimise FPGA resource consumption per kernel.
For time variant problems, since the parameters only vary between time steps, it is possible to have the
host processor pre-calculating the parameters for the next time step and send them to the FPGA. The
host side computation and data communication can be overlapped with FPGA computation, which
hides the communication overhead while saving FPGA logic resource consumption.
For space variant problems, the host processor pre-calculates M sets of parameters, to cover all pos-
sible space variations. The parameters are sent to FPGA at initialisation together with the stencil
76 Chapter 4. Explicit Finite Difference Method
???
???????
??????
?????
????
???? ???????????
??????????????????
????????????????????
?????????????? ???????
?????????
???
???????
??????
?????
????
???? ??????? ??
?????????????? ?
????????????????
??????????????????? ????????
?????????
???
???????
??????
?????
????
???? ???????????
?????????????????????
????????????? ?????????????
???????????? ????????? ???????
???? ??
???
???????
??????
?????
????
????
??????????????????????
??????????????????????????
????????????
???????? ?????????????
?? ????????? ?? ???????????????
Figure 4.4: Workload distributions of the four explicit ﬁnite difference categories in a heterogeneous
system.
parameters, to avoid unnecessary FPGA logic resource consumption, at a cost of increasing on-chip
memory consumption.
For time-space variant problems, as the parameters are different for each node calculation in the EFD
grid, the parameters computation has to be included in the FPGA datapath, so as to avoid communi-
cation overhead and guarantee full-pipelining. In general, time-space variant problems are the most
hardware logic resource demanding for FPGAs.
4.3.2 Theoretical workﬂow
The general EFD procedure can be described mathematically. Each option can be represented by a
tuple κ ≡ (S,K, T, r, σ). Discretising in both the time space and asset price space, a grid Z can
be obtained. Two examples of the grid are shown in Figure 4.1 and Figure 4.2. In Figure 4.1, Zi,j
corresponds to the node value v at time ti and asset price level Sj , and in Figure 4.2, Zi,j corresponds
4.3. Parallel Architecture Exploration 77
to the node value u at time ti and asset price level zj . Since the grid is updated column by column
from right to left, at each particular time ti the grid can be viewed as a single column where each grid
node within the column is updated, based on the values of the grid nodes in the column at time ti+1.
For simplicity, the entire grid is viewed as a single column being updated along the time line from
now on. Figure 4.5 shows the proposed mechanism for intra-column spatial parallelism. We split
each column in the grid into NM equally sized sub-columns on the price dimension, with each sub-
column covering J elements, where J = M
NM
. For simplicity, it is assumed M can be divided exactly
by NM , where M is the number of rows (asset price steps) in the original grid. It is ideal to have all
the sub-columns being processed simultaneously at each time step, however, there is data dependency
at each splitting point. This can be overcame by having the entire column updated by updating each
sub-column independently at ti, with synchronisation occurring between adjacent sub-columns just
before stepping from time ti to ti−1. To allow synchronisation, data duplication is required within each
sub-column, in the case of a two dimensional grid, the actual size of each sub-column isMsub = J+2.
To allow intra-column parallelism, we deﬁne a new indexing scheme for the grid in Figure 4.2 as the
following. n is the index of the sub-column, 0 ≤ n < NM ; i is the index of time step, 0 ≤ i ≤ N ;
and j is the index of price step within a sub-column, 1 ≤ j ≤ J .
Un,i,j ≡ ui,n×J+j−1
Un,i,0 ≡ ui,(n−1)×J−1
Un,i,J+1 ≡ ui,n×J
Zn,i,j ≡ zi,n×J+j−1
Zn,i,0 ≡ zi,(n−1)×J−1
Zn,i,J+1 ≡ zi,n×J .
The initialisation procedure is deﬁned by init(ℵ,, Q) → , where ℵ is the set of nature numbers
used as index,  is the set of real numbers used as initial values, Q is a set of parameters dependent
78 Chapter 4. Explicit Finite Difference Method
??????
????????????
??????????
??? ???
??? ???
??
??
??
??
?
??
???
??
??
???
???
???
???
???
?
???
??????????
?? ????
???????
??
??
??
??
?
???
??
??
??
??
??
?
???
?????????
???
?????????
????????
??
??
??
??
?
???
??
?????????
????????
??
??
???
??
??
???
???
???
?
??
??
??
??
?
??
???
???
??
??
?
??
???
???
???
??
??
??
???
???
???
??
??
??????????????????
??????????????????????????
???????????????
Figure 4.5: One column view of the grid shown in Figure 4.2 at time step i, and the original column is
split into NM equally sized sub-columns each with size Msub = J + 2. This entire column is updated
along the time line, synchronisation occurs between adjacent sub-columns just before stepping from
time i to i− 1.
to a particular option pricing problem, and q ∈ Q:
U ′n,0,j ← init(n× J + j, Zn,0,j, q), where 0 ≤ j ≤ J + 1.
The rightmost column in the grid is initialised by the initialisation procedure, and then the stepping
procedure can begin. The stepping procedure is deﬁned based on the update function f(ℵ,ℵ,,,, Q) →
:
U ′n,i−1,j ← f(i, n× J + j, Un,i,j−1, Un,i,j, Un,i,j+1, q) , where 0 < j < J + 1 (4.24)
U ′n,i−1,0 ← U ′n−1,i,J (4.25)
U ′n,i−1,J+1 ← U ′n+1,i,1. (4.26)
For different types of option pricing problems, different initialisation functions and stepping functions
4.3. Parallel Architecture Exploration 79
Name init : (ℵ,, Q) →  f : (ℵ,ℵ,,,, Q) → 
European (K − Sj)+ u′EU = αu1 + βu2 + γu3
American (K − Sj)+ u′US = max(K − Sj , u′EU )
Asian (zj −K)+ u′AS = 12 ·A(i, j)(u1 + u3 − 2u2) + u2
Barrier (Sj > qj)?(K − Sj)+ : 0 u′BR = (Sj > qj)?(αu1 + βu2 + γu3) : 0
Bermudan (K − Sj)+ u′MB = (ti = qi)?u′US : u′EU
Table 4.3: Example Option Pricing problems
is deﬁned by the user. Table 4.3 lists such functions for various option pricing problems. It can also
be seen from the deﬁnition of the stepping functions that for Time-Space variant Asian option pricing
problems, the update function is sensitive to time i and space j; and for constant European option
pricing problems the update function is not sensitive to either i or j.
There are in fact two kinds of data dependencies exhibited in this procedure. The ﬁrst is the intra-
column spacial dependency indicated in Figure 4.5, which is introduced by allowing parallelism in
the same time step for an EFD problem, as shown Equation 4.25 and Equation 4.26. The second is
the intrinsic inter-column temporal data dependency shown in Equation 4.24, such that the values of
nodes in the entire column at time step i depends on the node values at time step i+ 1.
The intra-column data dependency greatly reduces the efﬁciency of deeply-pipelined reconﬁgurable
architectures when parallelism is introduced within a time step. This is because the second to last
element of any sub-column, which is required at the beginning of the next time step, is usually the
last element to enter the pipeline, causing pipeline draining at every time step. Section 4.3.3 proposes
a way to overcome this problem, with careful rearrangement of execution order.
4.3.3 Hardware Independent Architecture
Based on the theoretical workﬂow, a method is proposed to map the workﬂow to hardware, and to
suit different performance requirements and hardware resource constraints. Two levels of parallelism
can be exploited for the EFD procedure:
• Coarse-grained parallelism: the valuation of multiple options in parallel; as more coarse-
grained parallelism is introduced, more options can be priced at the same time.
80 Chapter 4. Explicit Finite Difference Method
?????
????
???
?????
????
?????
????
???????????
?????
????
???
????
????
?????
????
??????????????????????
????
??????????
?????????
???????????
???
??????????
??????
????????
??????
??????
??????
???
???
???
??? ?? ????
??????????
?????? ?
??????? ???
?????????
????????
??????????????
? ???????
????????
????????????? ????????? ???????????
Figure 4.6: Generic architecture for computing the ﬁnite difference model.
• Fine-grained parallelism: the valuation of multiple grid nodes for a single option in parallel;
as more ﬁne-grained parallelism is introduced, the valuation speed per option becomes faster.
As the amount of computational resources are limited, trade-offs between the coarse and ﬁne gran-
ularities need to be made dependent on the user requirements and the target device. The proposed
method allows the user to predict the performance across a spectrum of potential parametrisations
and choose the most appropriate one according to the user’s application requirements to latency and
parallelism; a corresponding concrete architecture is then generated. The method discussed in this
subsection is applicable to all EFD models for ﬁnancial options with one underlying asset. For op-
tions with more than one underlying assets, the method needs to be extended to cope with on-chip
memory size restrictions and off-chip memory bandwidth restrictions, as both the amount of memory
required to store intermediate data and the amount of data exchanged between ﬁne-grained cores in
the same coarse core per time step grows exponentially with the number of underlying assets.
General Architecture
According to Table 4.3, it can be seen that a regular stencil, which convolves three previous calcu-
lated values to update the next node in the grid, can be found in all explicit ﬁnite difference solvers.
The general architecture shown in Figure 4.6 is designed based on this analysis and a number of
prototypes. The architecture is comprised of the following components:
4.3. Parallel Architecture Exploration 81
• Main Controller: the Main Controller co-ordinates the overall process and communicates with
software.
• Coarse Core: the Coarse Core is the main processing unit. Each Coarse Core processes one
option at a time, so duplicating Coarse Cores will increase coarse-grained parallelism. Each
Coarse Core consists of one or more Fine Cores, and is assembled from multiple interconnected
Fine Cores.
• Fine Core: the Fine Core is the basic computational unit. As the overall throughput is pro-
portional to the throughput of the Fine Core, it is designed to be a fully pipelined block which
takes a set of inputs and produces a set of outputs at each clock cycle. If the architecture is used
to solve a different problem, only the Fine Core is changed. In the proposed method the Fine
Core is generated from a high level description. Many Fine Cores are wired up to assemble a
Coarse Core for higher throughput, yielding more ﬁne-grained parallelism. However, one Fine
Core alone can still form a single-core Coarse Core.
• Data Module: the Data Module adopts double buffering to fully utilise the pipeline. FPGA
embedded memory is used to implement the Memory Module.
• Memory Controller: the Memory Controller couples the Memory Modules and Coarse Cores,
and makes sure that data are retrieved and stored according to the correct schedule. It is capable
of providing one set of parameters to a Fine Core per clock cycle.
• Initialiser and Finaliser: the Initialiser initialises the memory module by setting up the ini-
tial option prices, and the Finaliser ﬁnishes the process by preparing data to be sent back to
software.
In order to use high-latency and deeply-pipelined functional units to achieve high clock rates while
still achieving high throughput, we continuously stream parameters into the Fine Core pipeline to
evaluate nodes in the same column. In this way, the design is fully pipelined and one grid node is
evaluated per Fine Core per clock cycle.
Apart from parallelism in time (pipelining), we exploit spatial parallelism (replication) for the EFD
procedure. One possibility is to exploit ﬁne-grained parallelism. As discussed in Section 4.3.2, there
82 Chapter 4. Explicit Finite Difference Method
??? ???
??? ?????
???
??
???
???
??
?
??
??
??
??
??
??
??
??
??
???
??
??
??
?
?????
??
??
??
??
??
??
???
??
??
??
???
??
??
??
??
??
??
???
??
??
??
???
?????????
?????????
Figure 4.7: Proposed realisation of ﬁne-grained parallelism, based on Figure 4.5.
are data dependencies between any two consecutive columns in the grid to be processed, therefore
column-wise parallelism is not possible. On the other hand, the nodes in the same column are in-
dependent of each other and so can update in parallel. Similar to the theoretical demonstration in
Figure 4.5, in the simplest case shown in Figure 4.7, the original grid is split horizontally into two
halves of the same size. The upper half and lower half of the grid are updated concurrently by two
ﬁne cores, in order to double the processing speed per coarse core. If the column being iterated
over is divided into NM separate arrays and processed simultaneously by NM Fine Cores, the spatial
parallelism is increased by a factor of NM .
However, the deeply-pipelined functional units can impose problem to ﬁne grained parallelism: for
naive implementations where the elements in the sub-column are processed one by one in the order
of 1 . . . (J + 1), the intra-column data dependency greatly reduces the efﬁciency of deeply-pipelined
reconﬁgurable architectures. Since the (J + 1)th element of any sub-column, which is needed in
the swap at the end of any time step, is the last element to enter the pipeline, pipeline draining will
occur at each time step. To overcome this, a inwards interleaving accessing pattern is used to avoid
pipeline draining. A simple example is shown in Figure 4.8, where the ﬁrst and last elements in the
sub-column are streamed into the pipeline ﬁrst, and the entire sub-column is then updated from the
two ends inwards to the middle. In particular, assuming we have two ﬁne cores in a coarse core, and
imagine the two cores processing two identical sub-columns each containing 6 elements. The bound-
ary elements for each sub-column are omitted for simplicity. To begin with, element 0 is calculated
4.3. Parallel Architecture Exploration 83
?
?
?
? ? ?
?
?
?
?
?
????
??????
????????
???????
?
?
?
?
? ? ?
???????
?
?
?
?
? ? ?
???????
?
?
?
?
? ? ?
???????
?
?
?
?
? ? ?
???????
?
?
?
?
? ? ?
???????
?
?
?
?
? ? ?
???????
????????
?????
?????
???? ???????? ??????????????????????????????
???
Figure 4.8: Simple inwards interleaving accessing pattern working with a pipeline.
with its associated values in both of ﬁne cores. This is followed by the calculation of element 3, ele-
ment 1 and element 2. Based on Figure 4.7 we know data swap of the boundary elements between two
ﬁne cores happens after all elements in the sub-column are processed. If the normal execution order
is used, element 3 will still be in the pipeline when the data swap occurs, causing pipeline stalling
and affecting performance. With the proposed access pattern and with enough elements in the sub-
column, the ﬁrst and last element are always the ﬁrst to leave the pipeline, allowing full pipelining
in the implementation. In this way, the (J + 1)th element is processed at the beginning, therefore it
will be ready for swap as long as the pipeline depth is smaller than Msub. A traded American option
may have a time to maturity of 20 years, if the underlying asset price is sampled on a daily basis, the
EFD grid will have N = 7300 time steps. The number of asset price steps M depends on the actual
problem. A typical EFD grid setting is ΔZ = σ
√
3Δt [30], if σ = 0.5 and Z has a range of 100,
we require M = 2206, which means 2206 EFD grid nodes can be evaluated in parallel per time step.
As the total pipeline latency is typically of the order of tens of clock cycles (depending on the actual
device), the ﬁne parallelism factor NM can therefore be in the order of tens.
Figure 4.9 shows a more detailed architecture based on the generic architecture shown in Figure 4.6,
which exploits both ﬁne-grained parallelism and coarse-grained parallelism. To allow data exchange
between adjacent sub-columns at the end of each time step, adjacent pairs of data modules have ded-
icated communication channels between them, in order to handle the intra-column data dependency.
Coarse-grained parallelism allows more than one option to be priced at the same time, by duplicating
the architecture shown in Figure 4.9. However, a straight forward implementation can lead to a less
84 Chapter 4. Explicit Finite Difference Method
???
?????
??????
?????????
????
?????
? ??????
????
??????
??????
???
?????????
????
???
???????
??????
???????
??????
???????????????????????
?????
??????
?????????
????
?????
? ??????
????
??????
? ???????
???
?????????
????
???
???????
??????
???????
??????
?????????????????????????
?????
??????
?????????
????
?????
? ??????
????
??????
? ???????
???
?????????
????
???
???????
??????
???????
??????
?????????????????????????
???
??
???
???
??
???
???
?
??
???
???
??
???
???
??
???
???
?
??
???
???
?????????
???????
?????????
???????
?????????
???????
??
???
???
??
???
??
??
???
???
???????????
???????????
???????????
???????????
???
???
Figure 4.9: A detailed architecture showing how intra-column data dependency is handled in the EFD
option solver.
efﬁcient design: if there are more options to be priced than the number of Coarse Cores, the options
in the queue need to wait until the pipeline is drained and the last element in the grid is calculated for
the current option. Option level pipelining is used to overcome this problem, by duplicating the data
module within the processing module. While one data module is providing data to the datapath, the
other module is initialised. In this way the processing module can switch to process other option and
avoid pipeline draining, as well as hiding the communication cost between the host processor and the
FPGA.
To one extreme, we can have a fully temporal architecture where there might be only one ﬁne core
in a coarse core, in such case no data exchange occurs at any time step, but each time step takes M
iterations to ﬁnish. On the other hand, we can also have a fully spacial architecture (in which case we
4.3. Parallel Architecture Exploration 85
????????
??????????????????
??????????????????????
??????????????????
??????????????????????
???????????????????????
??????????
???????????
?? ??????? ??????????????????
????????????????? ?????????????????
?????????????????????
????????????????
??????????? ?????????????????
???????????
??
?????????????????????
???
Figure 4.10: Generating Coarse Cores by customising the number of Fine Cores targeting a given
ﬁnite difference model. DAG stands for Directed Acyclic Graph.
have a systolic array approach) , where the number of ﬁne cores exist in a coarse core equals toM . In
this way one time step will only take one iteration, but there will be M data exchanges at each time
step.
To achieve pipelining between major components in a fully temporal architecture, a low speed storage
buffer is needed to store the updated values in the sub-column and a high speed buffer is needed for
data exchange; in the fully spacial architecture, a high speed buffer is required to store updated values
in the sub-column, and a low speed storage buffer is required to store exchanged data. Since the
pipeline depth is high, in the fully temporal case each sub-column needs to have enough data to
keep the pipeline busy; in the fully spacial case enough number of option is a prerequisite for full
pipelining. However due to hardware resource limitation it is not feasible to use the fully spacial
architecture unless the number of rows in the EFD grid is small, in which case the result becomes
numerically inaccurate, therefore the former is considered in this chapter.
86 Chapter 4. Explicit Finite Difference Method
?????????
????????
????
?????? ????????? ? ?
Figure 4.11: Example architectural design for the block Fine Core in Figure 4.9 and Figure 4.12. The
solid black boxes denote inputs and output.
4.3.4 Design Automation
To automate the procedure of customising the amount of ﬁne-grained spatial parallelism, we propose
a method for developing a coarse core generator for different applications which is shown in Fig-
ure 4.10. The generator takes in three parameters: the expression of the equation for a given ﬁnite
difference model (for example, f : (ℵ,ℵ,,,, Q) →  shown in Table 4.3), the precision (single
or double precision, and can be extended to cover any number representation) and the parallelism in
terms of the number of Fine Cores.
The expression is parsed into a Directed Acyclic Graph (DAG), with each node in the DAG mapping
to a hardware arithmetic operator. The DAG is used to generate the Fine Core description in the target
platform; an example of a European Fine Core can be found in Figure 4.11. For each vi,j evaluation, it
consumes a tuple of inputs provided by the Memory Controller: three previously evaluated grid node
values (vi+1,j−1, vi+1,j and vi+1,j+1) and a set of parameters. In this case the parameter set contains
three constants (α, β and γ).
The control logic can be derived from the desirable data movement between Fine Cores. The Fine
Cores are replicated as speciﬁed by the user and connected to each other by appropriate interconnec-
tion logic to form the main part of a Coarse Core; a controller is also included in the Coarse Core
to control data communication between Fine Cores. Figure 4.12 shows a detailed architecture of a
proposed data module and a Fine Core. To allow inwards interleaving memory access (Figure 4.8),
two sets of FIFO registers are used so that only one memory access occur at each clock cycle. The
4.3. Parallel Architecture Exploration 87
??
??
??
???
???
???
???
???
???
???
??
???
??
??
??
?
???
???
???
???
???
???
???
???
???????????
?????
????
?????
???
???
??
?????
????
???????
????
?????
??
?????
????
???????
??
??
??
??
???
?
????? ???????????
?????????
?
?
?
??
?
?
??? ???? ?????
??? ???? ?????
?
??
??????
????????????
Figure 4.12: Detailed architecture for the connection between Data Module and a Fine Core. The
Fine Core is automatically generated from an update function shown in Table 4.3.
upper set of registers store values from a window sliding downwards in the array and the lower set of
registers store those from a windows sliding upwards. The two sets of registers feed the pipeline in an
interleaving manner so that the pipeline is always busy. Once the results came out from the pipeline,
the 1st and the (J+1)th element which are stored in registers are updated ﬁrst, to allow instantaneous
data swap at the end of each time step. The generated Fine Core is connected to the Data Module and
the control to form a complete processing module. The Fine Core is then duplicated to form a Coarse
Core.
The Coarse Core description can then be compiled for simulation or for hardware generation. The
precision and parallelism parameters are adjusted iteratively to suit user speciﬁcation based on result
accuracy, throughput and power consumption, from the simulation and hardware generation result.
Currently our designs target both the Handel-C language and the MaxJava language, as explained in
Section 4.41.
In general, Fine core can be created to price any option with one underlying asset, not limited to
the ones shown in Table 4.3. In addition it can also be used to solve other PDEs such as the one-
dimensional Heat Equation: the array needs to be initialised with temperature values, and parameters
α, β and γ will be set according to the Heat PDE.
The Fine Core is the most resource intensive basic component in the architecture. In the asymptotic
1The Handel-C ﬂoating point library in use is discontinued by the hardware vendor; therefore the designs are also
targeted to the Maxeler hardware
88 Chapter 4. Explicit Finite Difference Method
case we would expect the overall performance to be dominated by the size and the speed of this block,
as other components consist of memories and a small amount of selection logic.
The architecture described in this section also allows the application of the dynamic constant recon-
ﬁguration technique, since parameters α, β and γ change slower than the other input parameters [28].
Such a technique can allow ﬁnancial institutions to run real time risk calculations in a faster and more
energy efﬁcient matter in managing large portfolios. This is discussed in Chapter 5.
Performance Analysis
Since the proposed architecture is parameterised and iterative, we develop an analytical model to
enable fast design-space exploration. There are existing simulation-based analysis frameworks such
as [113] which can be used for rapid virtual prototyping and performance prediction. However, in
many cases performance prediction needs to be performed very early in the design-process, before
any concrete implementations are available. Our analytical model has two signiﬁcant applications:
a) given the available FPGA size, estimate the execution time for a given problem, by ﬁnding the
maximum possible number of cores replicatable on the FPGA; b) given execution time (performance)
requirements, determine the number of cores needed, and determine the most appropriate FPGA.
The model works under the following assumptions:
• The design is fully pipelined. This is true in the design for option payoff evaluation calcula-
tion discussed in this chapter, but the same is not applicable to the input/output logic. This
assumption is made to simplify the problem.
• The data transfer overhead is negligible. It is possible to overlap data transfer with calculation
when a larger number of option payoff evaluation requests exist. As a result only the data
transfer for the ﬁrst option is signiﬁcant; if the number of requests is large enough the data
transfer overhead becomes negligible.
• The on chip BRAM is used so that enough bandwidth is guaranteed to allow large number
of option payoff evaluation requests to be handled simultaneously. This is true in the design
4.3. Parallel Architecture Exploration 89
considered in this chapter, however if the problem data is too large to be stored on chip, the
architecture proposed in this chapter is no longer applicable and a stream based architecture
should be considered [117] to handle multiple time steps in the pipeline.
• The hardware resource available on the FPGA chip is utilised fully. In reality this can not be
true due to congestion, in the interest of problem simpliﬁcation we consider this to be the case
in this chapter.
Under the aforementioned assumptions, it is possible to use the proposed model to obtain a quick
performance estimation for a design. We now discuss the model in detail.
There are two types of hardware resources required to implement the proposed architecture: memory
resource and logic resource. Memory is used as storage for the grid, while logic is used in the control
and datapath. The amount of memory r required per Fine Core can be calculated by Equation 4.27,
where M is the number of elements per column in the grid, RFPGA is the number of rows per unit
memory on chip, WFPGA is the datawidth of the on-chip memory and Wdata is the number of bits
used in the datapath.
r =
⌈
M
RFPGA
⌉
×
⌈
Wdata
WFPGA
⌉
(4.27)
The amount of logic ulogic consumed by the datapath can be represented by Equation 4.28, given a set
of numerical operators ϕ.
ulogic =
∑
i∈ϕ
(num(i)× util(i)) (4.28)
where num(i) → N, i ∈ ϕ is a function to return the number of instances of operator i; util(i) →
N, i ∈ ϕ is a function to return the resource consumption of operator i. We consider the total logic
resource consumption approximately equal to u, since the datapath logic involves single precision or
double precision ﬂoating point calculations which are much more resource intensive than the control
logic.
90 Chapter 4. Explicit Finite Difference Method
The number of Fine Core duplications c within a given FPGA can be calculated by Equation 4.29:
c = min
(⌊
LFPGA
u
⌋
,
⌊
MFPGA
r
⌋)
(4.29)
where LFPGA is the total amount of logic resource available on the FPGA and MFPGA is the amount
of memory resource available on the FPGA chip2.
Given a clock frequency clk, the throughput of the whole FPGA in terms of nodes/second is calcu-
lated by
throughput = c× clk
since the proposed architecture is fully pipelined. The clock frequency can be predicted since it is
bounded by the ﬂoating point operators; the rest of the system is highly pipelined.
If there are a total of s options waiting to be priced on the FPGA, each with problem size d, deﬁned
as the number of intermediate node calculations in the grid, the total execution time can be estimated
by
texe =
s× d
throughput
(4.30)
For ﬁnance related EFD problems, d can be estimated by M3 × σ2 × T , where M is the number of
rows in the grid, σ is the stock volatility and T is the time to maturity [114].
If we reverse Equation 4.30 to calculate throughput given execution time requirement, given an ex-
pected clock frequency we are able to calculate the required number of cores c, as in Equation 4.31.
c =
s× d
texe × clk (4.31)
2The model can be extended to support off-chip memories available to the FPGA, but needs to be constrained by the
available off-chip memory bandwidth, which is subject to future work
4.4. Implementation of the Method 91
The resource requirement for the duplicated design is then calculated by Equation 4.32.
udup = c×
∑
i∈ϕ
(num(i)× util(i)) (4.32)
These equations are used in Section 4.5.4 to compare theoretical result with experimental result.
4.4 Implementation of the Method
In this section we discuss the implementation of the proposed architecture on FPGAs. The FPGA
implementations are based on the architecture proposed in Section 4.3, which fully utilise hardware
resources through pipelining and spacial parallelism.
4.4.1 FPGA Implementation
The FPGA implementations of all the modules described in Section 4.3.3 are based on the Hyper-
Streams ﬂoating point library, the Handel-C programming language, and the MaxJava programming
language [96]. HyperStreams is a high-level abstraction based on the Handel-C language [56], which
can be used to implement fully pipelined datapath for ﬂoating point arithmetic. Similarly, MaxJava is
a Java based stream programming language. The Fine Core generator is based on a C-based domain
speciﬁc language deﬁned in Chapter 3. The domain speciﬁc language is parsed by a parser generated
by ANTLR parser generator [104] and the result is targeted to either HyperStreams or MaxJava.
Figure 4.13 shows the implementation details of the Fine Core logic generated to support American
option payoff evaluation, which is connected to part of the data module shown in Figure 4.12, to
avoid overcomplexity in the illustration. A comparator is used to handle early exercise for American
options. Compared to Figure 4.11, the early exercise prices are stored in a lookup table by a template
conﬁguration in the control logic. The lookup table is located in a separate piece of on-chip memory
to allow simultaneous memory reads and ensure one set of parameters can be fed into the arithmetic
core in each clock cycle. For American options the early exercise prices are the same as the initial
92 Chapter 4. Explicit Finite Difference Method
?????????
?? ?
?
?
???
?????
??????? ??? ???
??? ??? ??? ?????????????
??????
??? ???
??
???
???
????????
?
???????
???????
?
??????
????????
Figure 4.13: The data ﬂow of the hardware implementation on FPGA for evaluating American option
payoffs. Note that for clarity the data module in this ﬁgure is represented in partial of what is shown
in Figure 4.12, although the actual implementation follows Figure 4.12. The registers in grey hold
constant parameters for an option; the bold squares are ports connecting to peer Fine Cores to swap
overlapped elements.
option prices in the grid, therefore they can be initialised simultaneously with the initialisation of the
grid, without introducing any further computational overhead.
To fully utilise the pipeline, double buffering is used to get around the FPGA memory port limitation:
data are read from one memory bank and the results are simultaneously written to another. Figure 4.14
captures the register states of a ﬁne core shown in Figure 4.13 in an arbitrary node evaluation during
an EFD run. Since memory accesses are in order, memory reads are pipelined so that vi+1,j and
vi+1,j+1 are reused to compute vi,j+1, and we only need to retrieve vi+1,j+2 from the memory. This
reduces the memory latency by a factor of three. Once the current set of parameters are in the pipeline,
the Data Module sends another set of parameters from the other end of the array based on the inwards
interleaving scheme, until the current time step is ﬁnished. The data at the splitting points are then
swapped via ports s1 and s2 between adjacent Processing Modules (PMs).
Fixed point arithmetic is very efﬁcient on FPGAs, with well-established optimisations such as word-
length optimisation of ﬁxed point arithmetic [72]. However, using ﬁxed point arithmetic requires
careful numerical range analysis to eliminate possible underﬂow/overﬂow, which is not always pos-
sible to do. We discuss the use of ﬂoating point arithmetic in this chapter and discuss utilising ﬁxed
point arithmetic in Chapter 5 for further optimisations.
4.4. Implementation of the Method 93
?? ?
?
?
?????
???????????? ?????? ????????
??? ? ?
??????????
??????
????? ?? ??????
?? ??
???
???
????????
?
???????
???????
?
??????
????????
?????????? ??????????
??
??
??
??
???
?
????
??
??????
?
????? ???????????????????
???????????
?????????????????????????
????????
??????
????????
?????????????
????????????
??????
??? ???
Figure 4.14: The register states of Figure 4.13 in an arbitrary node evaluation during an EFD run.
4.4.2 GPU Implementation
We now consider the implementation on Graphics Processing Units (GPUs). GPUs are another alter-
native to CPUs for computational intensive tasks, and have also been used for ﬁnancial computation,
so they are the key competitor to both the CPUs and the FPGAs [115]. The GPU implementations in
this chapter use CUDA version 2.0 from NVIDIA [116].
In our implementation we assign one option per thread block, and the grid points in the same time
step are updated simultaneously by all threads within a thread block. Multiple options are priced in
parallel, as many concurrently executing thread blocks are supported by the GPU. We use the CPU
to initialise parameters and allocate memory on the GPU, then the GPU initialises multiple grids in
parallel, followed by the option payoff evaluation phase. To exploit the CUDA single instruction mul-
tiple thread (SIMT) model [116], we use double buffering with low latency on-chip shared memory, to
avoid frequent accessing of the GPU’s high latency global memory and enforce memory coalescing.
The original grid is partitioned before copying to shared memory, in order to better utilise the limited
shared memory space. The elements in the same partition are then processed simultaneously by the
threads. The pseudo code of the CUDA kernel for European option pricing is shown in Listing 4.1.
The result is copied back to the CPU memory after the payoff evaluation procedure is complete.
94 Chapter 4. Explicit Finite Difference Method
Listing 4.1: CUDA kernel implementation pseudo code.
vo id PDEgpu ( r e a l ∗ i n pu t , r e a l ∗ ou t p u t ){
s h a r e d r e a l sharedA [NUM THREADS ] .
s h a r e d r e a l sha redB [NUM THREADS ] .
/ / NUM THREADS i s t h e number o f t h r e a d s pe r b lock .
t i d = t h r e a d I d x . x .
f o r ( j = N−1; j >=0; j−−)
f o r ( i = M−1; i >0; i−=NUM THREADS−2){
1 . Copy v a l u e s from i n p u t
t o sharedA .
2 . Synch r on i s e wi th o t h e r
t h r e a d s t o make s u r e
t h e copy i s done b e f o r e
we s t a r t c a l c u l a t i o n .
3 . sharedB [ t i d ] = upda t e f u n c t i o n f o r t h e o p t i o n .
4 . Synch r on i s e wi th o t h e r
t h r e a d s t o make s u r e t h e
c a l c u l a t i o n i s f i n i s h e d .
5 . Update t h e c o r r e s p o nd i n g
s e c t i o n i n i n p u t w i th t h e
v a l u e s i n sharedB .
} }
To support American option payoff evaluation we utilise texture memory on the GPU. The texture
memory is cached, and so provides similar performance to the shared memory, but any changes made
to the texture memory contents can only be seen in the next kernel launch [116]. In our implementa-
tion the texture memory is used as an alternative to shared memory to store early exercise prices for
American option, since the values of early exercise prices stay the same for the entire option payoff
evaluation procedure.
4.5. Results 95
4.5 Results
In this section we study the performance and measured energy consumption of our implementations
on FPGAs, CPUs and GPUs. The testing case is an option payoff evaluation problem based on
6K (price steps)×N (time steps) grids for both European options and American options, where N is
a large enough number to allow the problem run long enough so that execution time and power con-
sumption could be measured for each hardware platform. We also make sure that each test platform
has sufﬁcient tasks to fully utilise its capacity. For example, we price 720 options simultaneously on
Tesla C1060 to make full use of its multiprocessors; 4 options at the same time to utilise all physical
cores in the CPU; and replicate cores to utilise as many resources as possible on FPGAs. The perfor-
mance of both 32-bit single-precision and 64-bit double-precision ﬂoating point implementations are
considered.
All the FPGA and GPU implementations are compared to the software implementations on two refer-
ence Intel PCs, one of which is a 2.53GHz Xeon W3505 dual-core processor with 4GB of RAM, the
other one is a 2.93GHz quad-core i7 870 CPU with 16GB of RAM; both PCs run the Linux operating
system. The software implementations are written in C++, compiled with maximum speed optimisa-
tion options using Intel version 11.1 compiler. One of the target FPGA devices is a Virtex-4 VLX160
on an RCHTX board. The Virtex-4 FPGA designs are compiled using DK5.1 and Xilinx ISE 9.2. The
second target device is a Virtex-6 SX475T FPGA on a MAX3424 card, this is to demonstrate the pro-
posed method on a larger FPGA; the designs are compiled with MaxCompiler and Xilinx ISE 13.3.
The target GPU devices are an NVIDIA Tesla C1060 and an NVIDIA Tesla C2070 both with 4GB
of on-board RAM. Table 4.5 shows summaries of our performance comparison results for European,
American and Asian options implementations.
The power consumption is measured in terms of Additional Power Consumption for Computation [47],
using a FLUKE i30 current clamp together with a Maplin N56FU digital multimeter. The additional
current in the live wire of the main power cord is measured during the computation. The output of
the clamp is fed back through a USB connection via the multimeter and collected by the open source
software QtDMM. The sensitivity of the current clamp is 100mV/A in ±1mA resolution, and the
output is measured in terms of mV . The power consumption is deﬁned as the difference between the
96 Chapter 4. Explicit Finite Difference Method
Table 4.4: Resource consumption of Solutions with a single Fine Core.
European American Asian Device1
Word Length 322 643 32 64 32 64 -
Virtex-4 VLX160
LUT (%) 4 7 4 7 – – 135168
FF (%) 3 4 3 5 – – 135168
BRAM (%) 12 23 18 35 – – 288
DSP (%) 12 50 12 50 – – 96
Clock Freq. (MHz) 120 94.9 116 90.3 – – -
Virtex-6 SX475T
LUT (%) 0.87 1.2 0.96 1.3 4.55 7.29 297600
FF (%) 0.56 0.84 0.61 0.89 2.88 4.89 595200
BRAM (%) 0.7 1.36 1.41 2.73 2.26 3.2 1064
DSP (%) 0.3 1.49 0.3 1.49 1.19 4.51 2016
Clock Freq. (MHz) 280 260 280 260 260 235 -
1 Total resource available on device
2 32-bit single precision ﬂoating point
3 64-bit double precision ﬂoating point
power consumption at computation and the power consumption at idle of the device. This can be con-
sidered as the dynamic power consumption for a particular computation on a particular device. This
is a more useful metric than the power consumption for the entire system as it measures the absolute
power used for computation, and so is directly related to the computational efﬁciency of a particular
device.
4.5.1 Computational Performance Analysis
Table 4.4 shows the FPGA device utilisation for option pricer implementations with a single Fine
Core. The European option solvers belong to the constant category, they are the most hardware
resource efﬁcient. It can be seen that on the Virtex-4 FPGA less than half of the FPGA device is
utilised for both double-precision and single-precision arithmetic, while for the Virtex-6 FPGA, any
single core implementation utilises under 2% of the total on-chip resources.
The American option solvers belong to the space variant category, they are more memory resource
intensive. As a result the logic utilisation of the American option pricers stays almost the same, while
memory usage is about 50% higher than the corresponding European option implementations, since
a lookup table is used to allow the early exercise feature. The American option implementations are
more memory bounded than the European option implementations.
4.5. Results 97
The Asian option solvers belong to the time-space variant category, which is the most logic resource
intensive. It can be seen that the logic resource consumption of the Asian option solvers are almost
four times greater than those of American option solvers, due to more arithmetic operations involved
in the Fine Core. As a result, the Asian option implementations are logic resource bounded.
It is possible to signiﬁcantly improve performance by replicating valuation cores on FPGAs. Exper-
iments are done to replicate kernels over both coarse and ﬁne dimensions. A detailed discussion of
performance impact over ﬁne and coarse grained parallelism is shown in Section 4.5.3, in this section
we compare the best performance of each device.
The acceleration results for European, American and Asian options are shown in Table 4.5, along with
a more convenient comparison matrix in Table 4.6 for American option solvers. The single precision
Virtex-4 FPGA implementation for American option is able to achieve a maximum speedup of 4.1
times over the software on the Xeon CPU. If double precision arithmetic is used, the FPGA is 1.3
times faster than the software. The maximum accelerations the FPGA achieved for American op-
tions are generally less than those achieved by the corresponding European option implementations.
The reason is that additional memory space is needed to store the early exercise prices, making the
American option implementations more memory resource bounded than the European option imple-
mentations. On the other hand, a single American option core is able to achieve a higher speedup:
the FPGA design is fully pipelined, and the performance of the American option core is close to the
European option core; it is not the case in software implementations since more instructions imply
longer execution time. Similarly, the Virtex-6 FPGA implementation is also memory bound.
The American option speedup results for the two GPUs are also shown in the lower part of Table 4.5.
It can be seen that the absolute processing speeds achieved by the American option implementations
on the two GPUs are less than the corresponding GPU European option implementations. However
the speedups achieved by the GPU American option implementations are greater than the European
option implementations. The reason is that for GPUs the memory overhead can be hidden by massive
parallelism, as the GPUs can continue to process other data elements while waiting for the result from
the memory.
It is also interesting to see that for Asian options, the FPGA implementations are over two times
98 Chapter 4. Explicit Finite Difference Method
faster than the GPU implementations, while they are of the similar speed in the American option case
and slightly faster in the European option case. This is due to the transcendental functions involved
in the Asian Fine Core, which are much less parallelised (1/8 of normal operations such as multi-
plication in the Fermi Architecture) and takes more clock cycles to compute (8 clock cycles for a
single precision transcendental operation) in the NVIDIA GPU architecture. On the other hand, for
FPGA implementations the performance mainly depends on the possible number of kernel duplica-
tions on the FPGA chip and the clock frequency. In the Asian option case the clock frequency of the
FPGA implementations are similar to European and American options, and the number of possible
kernel duplications are about 30% less than those of American options, due to higher logic resource
consumption by complicated arithmetic operations such as transcendental functions. As a result, the
performance reduction is much less for FPGA implementations compared to GPU implementations.
4.5.2 Energy Consumption Analysis
If energy consumption is taken into consideration, we ﬁnd that the FPGA implementations are much
more energy efﬁcient than the GPU implementations in general, regardless of device technology
differences.
In terms of energy efﬁciency shown in Table 4.5, an 8-core design for American options on the
Virtex-4 FPGA at 106MHz is 4.5 times more energy efﬁcient than the Tesla C1060 GPU, based on
single-precision arithmetic. If the energy efﬁciency of double-precision arithmetic is considered, the
Virtex-4 implementation outperforms the Tesla C1060 GPU and the Xeon CPU by 2 times and 10.5
times respectively. The FPGA implementations on the Virtex-6 FPGA offer better energy efﬁciency:
the single precision design is over 230 times more energy efﬁcient than the Xeon CPU and 30 times
more energy efﬁcient than the Tesla C1060 GPU; the double precision design is over 108 times and
over 15 times more energy efﬁcient than the i7 CPU and Tesla C2070 GPU respectively. The European
option solution shows a similar result.
Figure 4.15 shows a plot of energy consumption against computational time to compute 6K×30K
single precision American option grids on different devices. It is desirable that a device should appear
4.5. Results 99
????????
?????????
?????
?????
??????
???
?
??
???
???? ??? ? ??
??
??
??
??
??
??
?
??
??
???
??
??
??
?
?? ???????
?????????????????
????????????????
????????????
??
??
??
?
??
??
??
???
???
???
??
??????????? ?????????
???????????????
??????????
???????????????
Figure 4.15: Energy consumption against computational time to compute 6K×30K single precision
American option grids on different devices.
as close to the origin as possible, since the closer to the origin, the more time/energy efﬁcient the
device is. We divide our target devices into two groups in terms of year of introduction, namely a
group of devices manufactured before year 2010 and another group produced since year 2010. It
can be seen that, the Virtex-6 device is 14.6 times faster and 6.6 times more energy efﬁcient than the
Virtex-4 device; the Tesla C2070 GPU is 2 times faster and 1.5 times more energy efﬁcient than the
C1060 GPU; and the i7 CPU is 2.5 times faster but energy efﬁciency is not clearly improved over
the Xeon CPU. In general, the Pareto frontiers are clearly separated without any overlap between
the two generations of devices, indicating a general improvement in these computational platforms
in terms of both energy efﬁciency and computational throughput. In the Pre-2010 group there is a
trade-off between energy efﬁciency and computational throughput: the Virtex-4 LX160 FPGA is the
most energy efﬁcient and the Tesla C1060 GPU is the fastest. If energy consumption is the priority
for the user, FPGA should be used for evaluating American option payoffs, otherwise the GPU will
deliver the highest throughput. On the other hand, in the Post-2010 group, the Virtex-6 SX475T
FPGA dominates both speed and energy efﬁciency over both GPUs and CPUs, which means that it
is the optimal device in terms of both performance and energy consumption for the application of
American option payoff evaluation.
100 Chapter 4. Explicit Finite Difference Method
4.5.3 Parallel Scalability
We now discuss the scalability of the parallel architecture described in Section 4.3.3. The 24 exper-
imental designs are generated by the proposed framework with a total number of up to 32 double
precision ﬁne cores for American option payoff evaluation. These ﬁne cores are then conﬁgured to
form different designs with up to 32 coarse cores. The designs are then targeted to a Virtex-6 SX475T
FPGA on a MAX3424 card, with MaxCompiler and Xilinx ISE 13.3.
Table 4.7 shows the performance comparison between different conﬁgurations of up to 32 ﬁne cores.
The top part of Table 4.7 indicates that, in general, adding more ﬁne cores in the design has negative
effect to the clock frequency. A one core design can run at 260MHz while a 32 core design can only
run at around 110MHz. However, the clock frequency decay does not outweigh the kernel duplication
effect, a 32 Fine Core design is in general about 14 times faster than a single Fine Core design despite
the fact that each core is running at a lower clock frequency.
The lower part of Table 4.7 shows the per-option processing speed of a particular conﬁguration.
Having more Fine Cores in a Coarse Core can increase the per-option processing speed. To one
extreme, the 32 Fine Cores are conﬁgured into one Coarse Core, boosting its processing speed to
3538 million nodes per second; however, only one option can be processed at a particular time. To
the other extreme, the available Fine Cores form 32 Coarse Cores, each containing just one Fine Core;
in this case the per-option processing speed is 110 million nodes per second; however, 32 different
options can be processed in parallel.
In addition, conﬁgurations with the same number of ﬁne cores usually have a similar total processing
speed. For example, with 32 Fine Cores, a conﬁguration of just one Coarse Core and a conﬁguration
of 32 Coarse Cores both have a total processing speed of around 3500 million nodes per second; the
processing speed variations are caused by random place and route results hence negligible.
The trade-off between ﬁne and coarse grained parallelism mainly affects the following main parame-
ters which concerns the user: processing latency and data transfer latency. Processing latency refers to
the time it takes to evaluate the payoff of a particular option. Data transfer latency refers to the time it
takes to transfer data from host to the reconﬁgurable hardware; for example, a design with one Coarse
4.5. Results 101
Core usually suffers less from data transfer latency compared to a design with 32 Coarse Cores, since
the design with 32 Coarse Cores requires 32 sets of parameters to fully utilise the hardware, while
the single coarse core design only requires one set. Data transfer latency is part of processing latency.
As discussed before, with enough payoff evaluation query, data transfer latency can be hidden by
overlapping the data transfer phase with the computation phase.
Figure 4.16 illustrates the execution times of portfolios with up to 64 American Options, solved by
designs with different parallelism conﬁgurations. Each design contains 32 Fine Cores, which are
conﬁgured into different number of coarse cores. These designs are then used to evaluate the payoff
of ﬁve different portfolios with 1, 8, 16, 32 and 64 American options respectively. The execution time
of each scenario includes the data transfer latency, and the total processing speeds of each design are
provided as reference.
It can be seen that, if there are less options in the portfolio than the number of Coarse Cores in the
designs, the execution times of the design tend to be signiﬁcantly higher than those with less Coarse
Cores, since not all cores are utilised, this can be seen from the 1, 8 and 16 options cases. On the
other hand, if the number of options in the portfolio are greater than the number of Coarse Cores in the
designs, all the execution times are similar, as seem from the 32 and 64 options cases. From this result
we can also ﬁnd that the data transfer latency does not affect the total execution time signiﬁcantly.
This is because the transfer rate between the host and the FPGA is high enough (2GBytes/Sec) to
make the initial data transfer latency negligible compared to the computation time.
Based on these results it can be concluded that, if the user requirement is latency centric, the de-
sign should be conﬁgured with less Coarse Cores and more Fine Cores within each Coarse Core;
since more Fine cores imply higher per-option processing speed. On the other hand, if the overall
performance is more important, more Coarse Cores should be duplicated with less Fine Cores within
them; since ﬁner granularity in hardware resource consumption implies a better chance to utilise more
hardware resources available on chip.
102 Chapter 4. Explicit Finite Difference Method
????
????
????
????
????
????
????
????
????
????
????
????
????
????
????
????
????
????
? ? ? ? ?? ??
??
??
???
??
???
??
??
???
??
???
??
??
??
??
?
??
??
??
??
???
??
???
??
??
?
?????????????????????????????????????????????????????
????????????????
????????
?????????
??? ??????
??? ??????
??? ??????
Figure 4.16: Measured execution times of portfolios with up to 64 American Options, solved by
designs under different conﬁgurations (each containing 32 Fine Cores).
4.5.4 Performance Trend Analysis
We now use the model described in Section 4.3.4 and compare the predicted result to the real result
for the Virtex-4 VLX160 device; same procedure can be applied to the Virtex-6 device but is omitted
to avoid repetitiveness. The VLX160 FPGA device has 288 Block memory units, with WFPGA ×
RFPGA = 18bit × 1024 row. Each single precision Fine Core is connected to twoWdata×N = 32bit
× 8192 row memories. Using Equation 4.27 we expect 32 Block memory units to be used. Compared
to our Virtex-4 result in Table 4.4 which uses 34 Block memory, the 2 extra Block memories are used
in peripheral logic to handle communication between FPGA and CPU.
Table 4.9 shows the resource consumption of the ﬂoating point operators involved in our implementa-
tions, obtained from Xilinx CoreGen and ISE 9.2 design suite. For other devices a similar table can be
obtained from tools provided by the FPGA manufacturer. For the single precision European Option
pricer with one Fine Core, which involves two adders and three multipliers, using Equation 4.28, we
estimate that 1528 slices are used by the arithmetic part. If we assume that 3% of the total LUT are
used by control logic and peripheral logic, the total number LUT used will be 5583. Table 4.8 shows
4.6. Summary 103
a comparison between the real and predicted results.
The difference between the predicted numbers in Table 4.8 and the result shown in Table 4.4 has two
causes: a) the actual results are based on the HyperStreams ﬂoating point library, while the predictions
use the Xilinx CoreGen library; b) peripheral logic such as communication logic between FPGA and
CPU is generated by the Handel-C compiler, so its resource consumption can only be estimated. If
the same ﬂoating point library were used, and the peripheral logic is excluded, the predicted value
will be much closer to the actual value.
Figure 4.17 shows the projected trend of resource consumption of the single precision European
Option solver implementation against Fine Core duplications, based on Equation 4.32. The dashed
line represents the amount of a particular resource available on our FPGA, and the solid line represents
the resource consumption of the solver with a certain number of Fine Core duplications. It can be seen
that the design is DSP-bound, but this can be overcome by using LUT-based multipliers instead of
DSP based Multipliers; the design is then Block memory bound which allows us to duplicate just over
10 Fine Cores on the Virtex-4 device. Similar analysis can be carried out for different implementations
on different FPGA devices to reveal the best design before the actual implementation.
4.6 Summary
In this chapter we have described a new methodology for accelerating option payoff evaluation au-
tomatically based on ﬁnite difference method. The parallelism of the proposed architecture can be
customised and generated from a high level description of the ﬁnite difference models, and the archi-
tecture is highly pipelined and capable of evaluate the payoff of multiple options concurrently. We
illustrate the approach by showing three case studies involving the payoff evaluations of European
option, American option and Asian option.
The performance of the generated implementations are the following: the single precision American
option pricer generated by the proposed approach running on a Virtex-4 VLX160 device at 102MHz
demonstrates a speedup of 4 times over a 2.53GHz Xeon W3505 dual-core CPU, while a Virtex-6
SX475T FPGA at 133MHz is 22 times faster than a more recent quad-core i7 CPU. The measurement
104 Chapter 4. Explicit Finite Difference Method
??
???
????
?????
??????
??
?
??
???
???
??
??
??
???
??
??
?
??????
????
?????????
???????????????????
????????? ???
?
? ? ?? ?? ??
??
??
??
?
??
???
???
??
??
??
???
??
??
?
????? ?????????????
?
????????????????
Figure 4.17: Estimated trend of resource consumption of the single precision European Option solver
implementation in Log scale, based on result in Table 4.4. Note that the dotted lines are the amount
of each resource available on a Virtex-4 VLX160 FPGA.
result indicate that the Virtex-4 FPGA is 4.5 times more energy efﬁcient than a Tesla C1060 240-Core
GPU at 1.3GHz, although it is 5 times slower in evaluation speed. The Virtex-6 FPGA is 1.4 times
faster and 19.9 times more energy efﬁcient than a Tesla C2070 GPU. The FPGA implementations
has a greater speedup compared to GPU implementations for Asian options; the single precision
implementation is 2.8 times faster and the double precision implementation is 2 times faster. They
are also 35 times and 25 times more energy efﬁcient than the corresponding GPU implementations.
In terms of energy consumption, for pre-2010 devices there is a trade-off between computational
throughput and energy efﬁciency: the Virtex-4 LX160 FPGA is the most energy efﬁcient and the Tesla
C1060 GPU is the fastest. For post-2010 devices, the FPGA device dominates both computational
throughput and energy efﬁciency: the Virtex-6 SX475T FPGA dominates both speed and energy
efﬁciency over both GPUs and CPUs.
For parallel scalability, we have shown that if the user requirement is latency centric, the design
should be conﬁgured with less Coarse Cores and more Fine Cores within each Coarse Core. If the
4.6. Summary 105
user requires higher overall performance, more Coarse Cores should be duplicated with less Fine
Cores within them.
In the next chapter we shall use EFD method as an example to discuss an optimisation process based
on dynamic reconﬁguration for stencil computation.
106 Chapter 4. Explicit Finite Difference Method
Table 4.5: European, American and Asian option performance comparison between various hardware
platforms.
FPGA GPU CPU
Device 1 V4 V6 C1060 C2070 Xeon i7
Year of Intro. 2005 2010 2008 2010 2009 2010
Word Length2 32 64 32 64 32 64 32 64 64 64
European
Clock Freq. (MHz) 106 82.9 121 120 3 1300 1150 2530 2930
Num. of Cores 8 3 110 60 240 448 2 4
Processing Speed 4 848 248.7 13310 7200 3057 1851 7682 5592 139.4 368
Power (W)6 4.2 4.8 12.1 10.2 118 119 155 157 41.5 103
Energy Efﬁciency 5 201.9 51.8 1100.0 705.9 25.9 15.6 49.6 35.6 3.4 3.6
Normalised Speed 6.1x 1.8x 95.5x 51.6x 21.9x 13.3x 55.1x 40.1x 1.0x 2.6x
Normalised Energy 60.1x 15.4x 327.5x 210.1x 7.7x 4.6x 14.8x 10.6x 1.0x 1.1x
American
Clock Freq. (MHz) 102 80.3 133 110 1300 1150 2530 2930
Num. of Cores 5 2 56 33 240 448 2 4
Processing Speed 510 160.6 7448 3630 2751 1647 5321 3430 124.6 332
Power (W) 5.0 5.2 11.0 11.0 122 121 156 157 42.3 104
Energy Efﬁciency 102.0 30.9 677.1 320.0 22.5 13.6 34.1 21.8 2.9 3.2
Normalised Speed 4.1x 1.3x 59.8x 28.3x 22.1x 13.2x 42.7x 27.5x 1.0x 2.7x
Normalised Energy 34.6x 10.5x 229.9x 108.6x 7.7x 4.6x 11.6x 7.4x 1.0x 1.1x
Asian
Clock Freq. (MHz) - - 123 100 1300 1150 2530 2930
Num. of Cores - - 40 24 240 448 2 4
Processing Speed - - 4920 2400 797 412 1756 1201 46.0 128
Power (W) - - 12.5 12.3 120 121 157 157 42.0 103
Energy Efﬁciency - - 393.6 195.1 6.64 3.40 11.2 7.65 1.10 1.26
Normalised Speed - - 107x 52.2x 17.3x 8.96x 38.2x 26.1 1.0x 2.8x
Normalised Energy - - 357.8 177.4 6.0 3.1x 10.2x 7.0x 1.0x 1.2x
1 We use Virtex-4 VLX160 and Virtex-6 SX475T FPGAs, NVIDIA TESLAC1060 and C2070 GPUs,
and Intel Xeon W3505 and i7 870 CPUs.
2 32-bit single precision ﬂoating point and 64-bit double precision ﬂoating point
3 Clock frequency is different from those reported in Table 4.4 due to congestion on chip
4 In million node evaluations/second. Note that the FPGA performance reported here is based on
the place and route result, the CPU and GPU performance are measured performance ignoring any
data transfer time. In reality the data transfer time can be a bottleneck to the calculation, however if
there are a large amount of option payoff evaluation requests, the data transfer process can overlap
with the calculation of the previous option, therefore the total performance has a small overhead of
the data transfer of one option.
5 In number of million node evaluations/second/Joule
6 Power reported in this table is the Additional Power Consumption for Computation [47]
4.6. Summary 107
Table 4.6: Speed and energy comparison grid for single precision American option implementations.
V4 V6 C1060 C2070 Xeon i7
Speed Comparison
V4 1 0.1 0.2 0.1 4.1 1.5
V6 14.6 1 2.7 1.4 59.8 22.4
C1060 5.4 0.4 1 0.6 22.1 8.3
C2070 10.4 0.7 1.6 1 42.7 16.0
Xeon 0.2 0.0 0.0 0.0 1 0.4
i7 0.7 0.0 0.1 0.1 2.7 1
Energy Comparison
V4 1 0.2 4.5 3.0 34.6 32.0
V6 6.6 1 30.0 19.9 229.9 212.1
C1060 0.2 0.0 1 0.7 7.7 7.1
C2070 0.3 0.1 1.5 1 11.6 10.7
Xeon 0.0 0.0 0.1 0.1 1 0.9
i7 0.0 0.0 0.1 0.1 1.1 1
Row against column comparison in terms of [speed
Comparison] or [Energy Comparison]. For example,
cell [V6,V4] in the [speed Comparison] section of the
table means the V6 device is 14.6 times faster than
the V4 device for American option implementations.
Note that the FPGA performance reported here is based
on the place and route result, the CPU and GPU per-
formance are measured performance ignoring any data
transfer time.
108 Chapter 4. Explicit Finite Difference Method
Table 4.7: Performance Comparison Between Differ-
ent Conﬁgurations of up to 32 Fine Cores. Note that
the FPGA performance reported here is based on the
place and route result.
Num. Fine Cores Per Course Core
Num. Coarse Cores 1 2 4 8 16 32
Clock Freq. (MHz)
1 260 1 195 178 156 127 111
2 198 173 177 127 108 -
4 182 161 139 107 - -
8 167 137 110 - - -
16 137 110 - - - -
32 110 - - - - -
Per Option Processing Speed (Million Nodes/Sec)
1 260 389 711 1246 2026 3538
2 198 346 707 1017 1729 -
4 182 321 557 857 - -
8 167 275 438 - - -
16 137 220 - - - -
32 110 - - - - -
1 This particular cell means a design with 1 Coarse Core,
each coarse core contains 1 Fine Core, is running at a
clock frequency of 260MHz.
2 Per Option Processing Speed is calculated as clock fre-
quency times number of Fine Cores per Course Core.
Table 4.8: Comparing predicted result with real result from Table 4.4
European American
Estimated Actual Estimated Actual
LUT 5583 5406 5583 5406
DSP 12 12 12 12
BRAM 32 34 48 51
Table 4.9: Resource Consumption for Each Floating Point Operators
Adder Multiplier
32 Bit 64 Bit 32 Bit 64 Bit
Slices 473 959 194 562
FFs 579 1150 252 749
LUTs 589 1226 136 558
DSPs 0 0 4 16
Chapter 5
Dynamic Constant Optimisation
5.1 Motivation
The previous chapter discusses a general architecture and an automated procedure to generate Explicit
Finite Difference (EFD) option payoff evaluators. Although the EFD method is widely used, its
computational complexity grows quadratically with accuracy.
Reconﬁgurable hardware such as FPGAs can be used to accelerate this process effectively in an
energy efﬁcient manner [26]. However, the work described in the previous chapter has yet to address
two important advantages of FPGAs over conventional processors such as CPUs: customisability and
reconﬁgurability.
In this chapter we discuss a new method based on dynamic reconﬁguration to accelerate stencil com-
putation by adjusting the underlying numerical procedure. Using EFD option payoff evaluation as
an example, the method is able to identify coefﬁcients for constant multipliers, such that both the
amount of required hardware resources per kernel, as well as the amount of computation required
per problem are reduced signiﬁcantly. The method presented in this chapter is general enough to be
applied to other stencil computations, such as Reverse Time Migration [117]. The main contributions
of this chapter are:
• An optimisation approach for stencil computation by exploiting run-time reconﬁguration of
109
110 Chapter 5. Dynamic Constant Optimisation
constants; this is supported by an analytical approach to estimate the performance gain using
this approach (Section 5.3).
• On top of the optimisation approach, a novel methodology to reduce the reconﬁgurable area up-
per bound and the amount of computation per problem, based on option parameters normalising
and statistical moments matching; while preserving result accuracy (Section 5.4).
• A workﬂow realising the methodology with two approaches: one targets to minimise hardware
resource utilisation, while the other targets to minimise the amount of computation required in
the algorithm (Section 5.4).
• An experimental evaluation of the methodology based on the workﬂow, measuring the amount
of hardware resource consumed, the amount of computation required and relative result error in
a ﬁnancial option pricing application, showing that the execution time of an optimised dynamic
design is 50% faster than the unoptimised dynamic design for a given accuracy (Section 5.5).
5.2 Explicit Finite Difference Method
The ﬁnite difference method is one type of stencil computation, and it has been discussed extensively
in Section 4.2.3. In this section we recap the aspects of the EFD method relevant to this chapter.
The Black Scholes PDE with one variable (asset) following geometric Brownian motion has the fol-
lowing form:
∂v
∂t
+ rS
∂v
∂S
+
1
2
σ2S2
∂2v
∂S2
= rv (5.1)
where v denotes the price of the option, S denotes the value of the underlying asset, t denotes a par-
ticular time, r is the risk-free interest rate, σ is the volatility of the underlying asset. For convenience
we deﬁne the option describer κ ≡ (S,K, r, T, σ), where K is the strike price of the option.
Suppose the time to maturity for the option is T . We discretise T by dividing it intoN equally spaced
intervals: Δt = T/N . We then determine an asset prices Smax and Smin for the option and discretise
the asset price space between Smax and Smin into M equally spaced intervals ΔS. Smax and Smin
5.2. Explicit Finite Difference Method 111
?
?
??
???
???
????? ??? ???
????? ???????
?????????
??????????
?
?
??? ??
???????????
???? ??????
?
?
Figure 5.1: A 5 × 6 ﬁnite difference grid, the grey elements illustrate a three-element stencil, with
the value of fi,j depending on three other values in time step i+ 1. Note that (x)+ ≡ max(x, 0)
represents the maximum and minimum asset prices under consideration in the EFD grid. We call N
the number of time steps andM the number of asset price steps,N andM together deﬁne a particular
grid setting for the EFD model. In addition, we deﬁne the number of computational steps (problem
size) to be M ×N .
An efﬁcient approach to compute within a ﬁnite difference grid can be obtained by change of vari-
able [118]. By discretising over Z = lnS instead of S, Equations 5.2 can be obtained, where stencil
coefﬁcients α, β and γ are constants throughout one grid.
α =
1
1 + rΔt
(− Δt
2ΔZ
(r − σ2/2) + Δt
2ΔZ2
σ2)
β =
1
1 + rΔt
(1− Δt
ΔZ2
σ2)
γ =
1
1 + rΔt
(
Δt
2ΔZ
(r − σ2/2) + Δt
2ΔZ2
σ2) (5.2)
The payoff function fi,j can have different forms, for example, Equation 5.3 and Equation 5.4 are
112 Chapter 5. Dynamic Constant Optimisation
the payoff functions for European and American options respectively, where K − jΔS is the early
exercise price of the American option.
vEUi,j = αvi+1,j−1 + βvi+1,j + γvi+1,j+1 (5.3)
vAMi,j = max(K − jΔS, vEUi,j ) (5.4)
Figure 5.1 shows how the EFD method updates nodes in a 5 × 6 grid. The grey elements illustrate
a three-element stencil, with the value of fi,j depending on three other values in time step i + 1.
The actual stencil computations are deﬁned by the payoff functions as shown in Equation 5.3 and
Equation 5.4.
As an initial condition, the values for nodes in the rightmost column of Figure 5.1 is calculated by
max(K − SN+1,j, 0) and the algorithm runs from right to left. A parallel architecture has been devel-
oped which harnesses two types of parallelism in the EFD procedure: First, coarse-grain parallelism,
which refers to the concurrent pricing of multiple options. Second, ﬁne-grain parallelism, which
refers to the simultaneous calculation of values on the grid for one option. The hardware architecture
is illustrated in Figure 5.2. The Main Controller performs overall control and facilitates communica-
tion with a host software application. The Coarse Core is the main processor for pricing one option,
and there can be several of these cores pricing multiple options simultaneously. Each Coarse Core
consists of one or more Fine Cores. Fine Cores are fully pipelined and calculate the value of the
present node based on three previously calculated options values as illustrated in Figure 5.1. Since
many of these calculations can be performed in parallel, we can assemble multiple Fine Cores into a
more powerful Coarse Core. The overall number of deployed cores depends on the available hardware
resources on the target device. The Memory Controller provides access to double-buffered memory
in order to fully utilise the pipeline in the Fine Cores. The Initialiser initialises the memory mod-
ule by setting up the initial option prices, and the Finaliser provides the results to the software host
application.
Figure 5.3 illustrates the structure of a Fine Core for calculating American options according to Equa-
tion 5.4. For each vi,j evaluation, it calculates one grid value based on three previous grid values and
5.2. Explicit Finite Difference Method 113
?????
????
???
?????
????
?????
????
???????????
?????
????
???
????
????
?????
????
??????????????????????
????
??????????
?????????
???????????
???
??????????
??????
????????
??????
??????
??????
???
???
???
??? ?? ????
??????????
?????? ?
??????? ???
?????????
????????
??????????????
? ???????
????????
????????????? ????????? ???????????
Figure 5.2: Generic architecture for computing the ﬁnite difference model, this Figure also appears in
Figure 4.6.
the three coefﬁcients. Note that the coefﬁcients remain the same throughout the option valuation pro-
cess. The Fine Cores are generated by a C-based domain speciﬁc language and the architecture is
generic enough to handle any option pricing problem with one underlying asset.
In order to organise the computational efﬁciently we note the following:
• The coefﬁcients α, β and γ are constant throughout the computation of one option.
• We can develop a specialised circuit for the option where these coefﬁcients are hard-coded,
resulting in higher performance and lower area requirements.
• Some coefﬁcient combinations result in smaller and faster circuits, further enhancing perfor-
mance and area beneﬁts gain from a specialised circuit.
• Algorithm level tuning can yield better result convergence, reducing the amount of computa-
tional steps required for a given accuracy requirement, hence improving computational efﬁ-
ciency.
To improve the performance of the EFD architecture described in the previous chapter, dynamic
constant reconﬁguration technique is used. The technique is based on constant specialisation, also
known as partial evaluation, which improves performance and reduces size of hardware designs [119].
114 Chapter 5. Dynamic Constant Optimisation
?????????
???????? ?????? ????????
????
????? ? ?
Figure 5.3: Hardware architecture of the block Fine Core in Figure 5.2 for pricing American Options.
The coefﬁcients α, β and γ are constant for a particular option to be priced. Solid black boxes denote
registers.
A slowly changing input to the design is assumed to be constant and the hardware design is optimised
for this particular constant, often resulting in a faster and smaller circuit. When the input changes,
the circuit is reconﬁgured. This approach is used in various applications such as encryption [120] or
FIR ﬁltering [121]. However, the reconﬁguration time itself imposes an overhead on performance and
needs to be balanced against the speedup obtained through constant specialisation. We shall address
this issue in Section 5.3.2.
The efﬁciency of the computation also depends on the convergence of the algorithm. The convergence
of the EFD method is deﬁned as the rate at which the difference between the true solution of the PDE
and the discretised solution decreases with the number of computational steps [61]. The discretisation
error is caused by transforming the model from a continuous mathematical space to a discretised com-
putational space; and ﬁnite precision error is caused by using number representations of insufﬁcient
accuracy. For convenience we deﬁne an elastic factor λ = ΔZ/Δt. If the number of time steps N is
given, the convergence of the EFD method can only increase by decreasing λ (increasing the number
of asset price steps M ) and vice versa.
5.3. Dynamic Constant Specialisation 115
5.3 Dynamic Constant Specialisation
It is proposed that three key steps should be followed in optimising high performance designs in
reconﬁgurable hardware from software implementations:
• utilising custom data formats to reduce area while preserving sufﬁcient result accuracy;
• using constant specialisation and reconﬁguration to further reduce area and energy consump-
tion;
• adopting carefully selected constants to reduce the upper bound of reconﬁgurable area.
Each step yields a higher performance than the previous step mainly due to higher level of parallelism
achievable given a ﬁxed amount of hardware resource. Recent work focuses on applying the ﬁrst step
and trying to reduce the ﬂoating point data-width aggressively, in order to trade result accuracy for
lower hardware resource consumption. For example, using a low precision datapath, higher perfor-
mance can be achieved by increasing sampling points in a numerical integration routine [48]; based
on statistical properties of the Monte Carlo method, low precision datapaths are adopted without af-
fecting result accuracy [25]. In this section we focus on the second step; and in Section 5.4 we discuss
the third step.
The second step mentioned above has been addressed previously by [92], in which the constant spe-
cialisation is applied to software-deﬁned radio. Without loss of generality, in the case of the EFD
stencil computation, the key concept of the optimisation approach is to specialise the circuit for
the coefﬁcients α, β and γ in Equation 5.4. In the previous chapter these coefﬁcients were loaded
into registers as illustrated in Figure 5.2. We can exploit the fact that these coefﬁcients stay con-
stant throughout the problem to price a particular option; therefore a specialised circuit with ﬁxed-
coefﬁcient multipliers can be created for that particular instance of the problem. This is beneﬁcial
since such ﬁxed-coefﬁcient multipliers can provide higher performance while requiring less area and
power. When the pricing of one instance of an option is completed, the circuit is reconﬁgured with a
specialised version for the next instance.
116 Chapter 5. Dynamic Constant Optimisation
??????????????
?????????
???????? ?????? ????????
????
????
???????????????
????
? ? ?
Figure 5.4: Dynamic EFD kernel for American option pricing. The reconﬁgurable area is marked in
grey.
5.3.1 Specialisation Approach
A dynamic EFD kernel is shown in Figure 5.4, with the replacement of the generic multipliers to the
constant specialised multipliers.
Specialising a multiplier for a constant input can lead to higher performance because the critical path
is shortened and higher clock frequency is achieved. In addition, area is reduced because redundant
logic can be removed. In parallel applications such as the ones outlined in this thesis, we can turn the
area reduction into an additional performance increase:
• Smaller cores imply higher parallelism on the same device.
• More cores result in higher overall processing speed.
We can explore the beneﬁt of constant reconﬁguration by implementing a range of designs and mea-
suring their performance. However, in many cases it is desirable to make performance predictions
without completing the lengthy implementation process for the entire design. In order to compare the
5.3. Dynamic Constant Specialisation 117
performance of static and reconﬁgurable designs, we propose a simple analytical approach. We begin
by calculating the execution time Ts for pricing one option in a static design:
Ts =
ns · ts
ps
(5.5)
where ns is the number of data elements (computational steps) to be processed in a problem for a
static design, ts is the cycle time, and ps is the number of processing elements (kernels) in the static
design. Equation 5.5 do not consider the delay through pipeline stages, as it is negligible since ns is
signiﬁcantly larger than the number of pipeline stages, this is discussed in more detail in Section 4.3.3.
Likewise, we can calculate the execution time Td for the dynamic reconﬁgurable design:
Td =
nd · td
pd
+ tr (5.6)
where nd, td and pd denote the same parameters as above for the dynamic design, and tr denotes the
reconﬁguration time. For this section only we consider the number of data elements to be processed
is the same for static and dynamic designs since no algorithm level optimisation is applied; hence
ns = nd = n. In order to achieve an overall performance beneﬁt, the execution time of the dynamic
design must be lower than the static design:
Td < Ts
n · td
pd
+ tr <
n · ts
ps
tr < n ·
(
ts
ps
− td
pd
)
(5.7)
Equation 5.7 shows that the decision of whether to adopt the dynamic design hinges on the reconﬁg-
uration time, tr. There are two choices for the dynamic design with different trade-offs:
• Full reconﬁguration, involving reconﬁguration of the entire device. This is relatively simple to
118 Chapter 5. Dynamic Constant Optimisation
implement since successive conﬁgurations can be completely independent of each other. How-
ever, the reconﬁguration time can be long for large devices since the entire chip is reconﬁgured.
• Partial reconﬁguration, involving reconﬁguration of only the parts that need to be changed. The
reconﬁguration time is proportional to the amount of FPGA area to be conﬁgured. However,
the design is more complex since the FPGA has to be specially partitioned and ﬂoorplanned for
reconﬁgurable areas.
In both cases above, since specialised operators are smaller and faster than the corresponding general-
purpose ones, the dynamic design is faster overall because it has a shorter cycle time and can support
more processing elements than the static design. Given the area Adesign that is available for the design
under optimisation (i.e. the full device for full reconﬁguration or a partially reconﬁgurable area for
partial reconﬁguration), the reconﬁguration time is given by:
tr = φ · θ · Adesign (5.8)
where φ is the throughput of the conﬁguration interface and θ is the conﬁguration size per unit of
area. Both values can be obtained from the device data sheet.
The number of processing cores p that can be implemented in the static and dynamic design versions
are:
ps = 
Adesign/As
pd = 
Adesign/Ad
where As and Ad are the area requirements of a static and a dynamic core respectively. With equa-
tions 5.7 and 5.8 we obtain:
1 <
n
tr
·
(
ts

Adesign/As −
td

Adesign/Ad
)
(5.9)
5.3. Dynamic Constant Specialisation 119
If the condition in the above equation is true, then the dynamic design is faster than the static one. The
following section explains how these parameters can be derived for European and American option
pricing applications.
5.3.2 Performance Analysis
To estimate the performance and area of a static or reconﬁgurable design, the following model is
adopted, given that the arithmetic operators within a core dominate the cycle time and the area. We
estimate the cycle time t and area A in the following way:
• The core cycle time is the maximum of the cycle times of all n arithmetic operators in the core,
i.e. t = max(top,1, . . . top,n)
• The core area is the sum of the area of all n arithmetic operators in the core, i.e. A =∑ni=1Aop,i
In the case of a European option pricing core there are three multiplications and two additions. The
cycle time and area for the static version can be estimated as follows:
ts = max(tmult,s, tadd,s)
As = 3 · Amult,s + 2 · Aadd,s
To estimate the performance and area of the dynamic version, specialised versions for all operators
that have constant inputs are built. In this case, only the multipliers can be specialised. Hence, the
cycle time and area are:
td = max(tmult,d, tadd,s)
Ad = 3 · Amult,d + 2 · Aadd,s
120 Chapter 5. Dynamic Constant Optimisation
We can perform the same optimisations for our American option pricing core (Figure 5.3):
ts = max(tmult,s, tadd,s, tmax,s)
As = 3 · Amult,s + 2 · Aadd,s + Amax,s
td = max(tmult,d, tadd,s, tmax,s)
Ad = 3 · Amult,d + 2 · Aadd,s + Amax,s
To evaluate a design according to Equation 5.9, the following steps need to be carried out:
1. Build static versions of all arithmetic operators found in the core, estimate ts and As.
2. Build dynamic versions of operators with ﬁxed inputs, estimate td and Ad.
3. Determine n from application speciﬁcation.
4. Determine available design area Adesign . For full reconﬁguration, Adesign is the area of the
entire device minus the area required by other control logic. For partial reconﬁguration, Adesign
is simply the area of a reconﬁgurable region that is created by the designer.
5. Determine reconﬁguration time tr for the entire device (full reconﬁguration), or for a recon-
ﬁgurable area (partial reconﬁguration). Reconﬁguration time can be measured or calculated
according to Equation 5.8.
6. Evaluate design according to Equation 5.9. If the condition is true, reconﬁguration is beneﬁcial.
If the above procedure indicates that the dynamic design is beneﬁcial, then we can specialise the
Fine Core with the given coefﬁcient and implement a Coarse Core with pd Fine Cores. Based on this
Coarse Core we can implement the entire design with all the necessary infrastructure as illustrated in
Figure 5.2.
5.4. Optimising Dynamic Constant Specialisation 121
5.4 Optimising Dynamic Constant Specialisation
The previous section discusses how to use constant specialisation and reconﬁguration to reduce area
and energy consumption. In this section we discuss how to further optimise the EFD option pricing
model, based on the result of previous section. One of the main concepts of this optimisation is to
minimise the upper bound hardware resource consumption of the reconﬁgurable area (the grey area in
Figure 5.4), so that higher parallelism is achieved by placing even more dynamic kernels in an FPGA.
5.4.1 General Idea
The idea of this optimisation process is to balance the following two goals: (a) to reduce hardware
resource consumption by carefully setting the EFD grid, so that the constant multipliers used in each
dynamic kernel consume minimal amount of hardware resource; (b) to reduce the number of com-
putational steps required in the EFD grid, so that the result meets the precision requirement without
wasting unnecessary computation.
There are three key steps in the process:
1. normalising the option coefﬁcients so that ﬁxed point datapaths, which consume less hardware
resources, are used instead of ﬂoating point datapaths;
2. identifying the nice constants in ﬁxed point number representation which yield constant mul-
tipliers that consume less hardware resource than the biggest constant multiplier of the same
number format; and adjust the EFD algorithm to make use of the nice constants;
3. ensuring the number of computational steps required in the optimised EFD scheme is not larger
than the original scheme.
The ﬁrst step allows the use of ﬁxed point number representation, by ensuring that the ﬁnite precision
error does not affect the quality of the result. Our previous work concludes that ﬁxed point numbers
should not be used for option pricing, because the range of the inputs is not predictable, and unlike
Monte Carlo simulations, ﬁnite precision error accumulates in the EFD scheme [26]. A normalisation
122 Chapter 5. Dynamic Constant Optimisation
procedure can be used to overcome this. Equation 5.3 and Equation 5.4 have type κ → R, where
κ ≡ (S,K, r, t, σ) describes an option. Although it is possible to deﬁne a range for all ﬁve inputs,
the intermediate variables fi+1,j−1, fi+1,j and fi+1,j+1 can range from zero to strike price K for an
American put option. Since 0 < K < ∞, we need to reserve enough bits for the integer part of the
ﬁxed point representation to avoid integer overﬂow in K. The normalisation procedure is shown in
Equations 5.10 to narrow down the numerical range of the intermediate variables. The option price is
then calculated by f = f ′ ×K, where f ′ is the result from the normalised space.
K ′ = t′ = 1, S ′ =
S
K
, r′ = r × t, σ′ = σ ×√t (5.10)
As a result, the previously unbounded variable f is now bound from zero to one, allowing the use of
ﬁxed point numbers in our design.
As the second step, we try to identify nice constants which yields small ﬁxed point constant multipli-
ers (measured in number of consumed LUTs, compared to the largest constant multiplier in the same
number format). To make the constant multipliers used in the EFD kernel consume fewer LUT re-
source, to one extreme, we would like the stencil coefﬁcients α, β and γ to be in the form 2E (E ∈ Z)
in binary. As a result, the constant multipliers can be implemented by shift operators in ﬁxed point
arithmetic, which consumes less LUTs than those implemented by adders.
Although we would like to make the constant multipliers consume as less LUTs as possible, the
stencil coefﬁcients must satisfy Equations 5.11, which are the ﬁrst two requirements to guarantee the
result convergence [118].
α + β + γ = 1, α > 0, β > 0, γ > 0. (5.11)
In addition, all lattice based numerical procedures must match statistical moments of the random
process which the underlying asset price follows. For stock option pricing, the Black Scholes PDE
is constructed by creating a delta-hedged riskless portfolio, based on a log-normal process which
describes the underlying asset price movement, as shown in Equation 5.12, where dW is a Wiener
5.4. Optimising Dynamic Constant Specialisation 123
process.
d lnS = dZ = (r − σ
2
2
)dt+ σdW. (5.12)
Such process has the following mean and variance over a discretised time period Δt [30]:
μ = (r − 1
2
σ2)Δt, Var = σ2Δt (5.13)
The EFD procedure must match mean μ and variance Var to ensure result convergence [30]. The
speed of convergence is determined by how well the procedures matches higher moments such as
skewness and kurtosis, the higher the convergence speed, the fewer the number of computational
steps is needed to obtain a result of desired accuracy. The trinomial stencil used in the EFD method
has enough degrees of freedom to match one higher moment (skewness or kurtosis) [42], although as
long as the ﬁrst two moments match and Equations 5.11 are satisﬁed, the stencil coefﬁcients α, β and
γ can vary while the EFD result is guaranteed to converge to the true result.
Figure 5.5 illustrates the relationship between Δt, ΔZ and the stencil coefﬁcients in an EFD grid. To
ensure that the statistical moments described in the stencil match the theoretical moments described
in Equations 5.13, the following equality must hold:
μ = μ′, Var = Var ′, (5.14)
μ′ = α× dZ − γ × dZ, (5.15)
Var ′ = α× (dZ − μ′)2 + β × (μ′)2 + γ × (dZ + μ′)2. (5.16)
Variables μ and Var are deﬁned in Equations 5.13. Solving Equations 5.11 for the stencil coefﬁcients
α, β and γ gives us the coefﬁcients in the form of (Δt,ΔZ, κ) → R, similar to those in Equa-
tion 5.2 1. This relationship can be used to explore the stencil coefﬁcients space and ﬁnd smaller
constant multipliers to be used in the dynamic EFD kernel. Figure 5.7 shows one example of the
change in the values of the coefﬁcients α, β and γ against λ = ΔZ/Δt, while Δt stays constant.
It can be seen that these coefﬁcients are all continuous convex functions against λ which increase
1Any mathematical tool with a symbolic solver, such as Mathematica, can be used for this purpose
124 Chapter 5. Dynamic Constant Optimisation
or decrease monotonically within the range [0,1]. The relative error, deﬁned as the absolute dif-
ference between the EFD result and the Black Scholes formula result using double precision arith-
metic (r .e. = |resultEFD − resultBS |), is also shown in Figure 5.7 for each λ. It can be observed that
the relative error increases with the asset price step ΔZ, with exceptions at the left boundary when
one of the coefﬁcient becomes very close to zero and conditions speciﬁed in Equations 5.11 are no
longer valid.
We now discuss the number of computational steps required in the EFD algorithm for the third step.
The EFD method requires the smallest number of computational steps to converge to the true result,
when both the model skewness and kurtosis are close to their theoretical values, one common practice
is to set the EFD grid equivalent to a trinomial tree [30]. However, with a given error tolerance, the
tree-equivalent EFD grid might not be the optimal setting.
Figure 5.6 shows a typical graph of relative error between the Black Scholes formula result and
the EFD result for the same at-the-money European option as in Figure 5.7 using double precision
arithmetic. It can been seen that, the absolute error decreases monotonically as Δt and ΔZ = Δt×λ
becomes smaller. This means, as long as the result accuracy meets the accuracy requirement, it is
possible to reduce the number of computational steps by increasing bothΔt and λ. In addition, since it
has been shown that the continuous (Δt, λ) space has an analytical mapping to the stencil coefﬁcients
space in Figure 5.7, we have deﬁned a valid search space for step two without affecting result accuracy.
It is worth mentioning that, similar to Figure 5.7, the boundary error becomes signiﬁcant when ΔZ
becomes small enough and Equations 5.11 no longer hold.
5.4.2 Optimisation approaches
In this section we discuss two approaches for the optimisation process described in Section 5.4.1.
The ﬁrst approach tries to map directly from hardware to the EFD algorithm. The intuition is to start
from a set of standard constants calculated by Equation 5.2, search the space Δt and ΔZ so that the
constants are as close as possible to the smallest possible constant multipliers, for example (0.25, 0.5,
0.25).
5.4. Optimising Dynamic Constant Specialisation 125
??
??
????? ???????
?????????
???????????????????
??????????
??????????
??
??????????
??????????
Figure 5.5: Relationship between the stencil parameters and Δt and ΔZ. Note that user can vary Δt
and ΔZ to obtain different stencil parameters.
We now describe the algorithm in detail: knowing a set of constants (cα, cβ, cγ) which yield small
constant multipliers, this approach tries to ﬁnd appropriate Δt and ΔZ which makes the stencil coef-
ﬁcients as close to these constants as possible, so as to reduce hardware resource usage by a dynamic
kernel. As a result, a multidimensional minimisation problem is formed, where variables Δt and
ΔZ are the search space and r and σ are problem dependent variables. The target function of the
minimisation problem is shown in Algorithm 2 which returns a utility value u; we minimise u to ﬁnd
appropriate stencil coefﬁcients for the EFD grid. This algorithm implicitly assumes that the amount of
hardware resource consumed by the constant multiplier is closely correlated to the hamming weight
of the constant. In addition, Algorithm 2 requires all the three coefﬁcients to be close to a ‘nicer’ set
of constants, which is not always possible since coefﬁcients α, β and γ are not free variables; in fact,
according to Figure 5.7 the three coefﬁcients has a one-to-one mapping to each other. Therefore the
valid search space is limited. Moreover, since the user has no control over Δt or ΔZ in this case, a
valid set of coefﬁcients might result in extremely small Δt and ΔZ, which may increase the problem
size signiﬁcantly and offset any beneﬁt gain by this optimisation routine.
The second approach is designed to reduce the number of computational steps needed with given
accuracy requirement. The intuition is as the following: starting from the fact that a three point
stencil can only match up to three moments, namely mean, variance and skewness or kurtosis, trying
126 Chapter 5. Dynamic Constant Optimisation
 0 0.001 0.002
 0.003 0.004 0.005
 0.006 0.007 0.008
 0.009 0.01
 0
 20
 40
 60
 80
 100
 120
 140
 160
 180
 0
 0.0005
 0.001
 0.0015
 0.002
 0.0025
 0.003
relative error
dt
λ
 0
 0.0005
 0.001
 0.0015
 0.002
 0.0025
 0.003
Figure 5.6: Relative error against Δt and λ, where λ = ΔZ/Δt. Note that dt and Δt are used
interchangeably here.
to ﬁnd a set of constants with yield the smallest difference in skewness and kurtosis. The set of
constants found in the optimisation process leads to a ﬁnite difference grid with minimal number of
computational steps to get an accurate result. Algorithm 3 achieves this by showing a target function
of a multidimensional minimisation process to match the EFD stencil skewness and kurtosis to the
theoretical values.
Skew ′ =
α× (dZ − μ)3 + β × (−μ)3 + γ × (dZ + μ)3
Var ′3/4
(5.17)
Kurt ′ =
α× (dZ − μ)4 + β × (−μ)4 + γ × (dZ + μ)4
Var ′2
(5.18)
The EFD grid setting calculated by Algorithm 3 guarantees fast convergence of the ﬁnal result, since
the setting makes the grid equivalent to a trinomial tree. However, the algorithm fails to address
5.4. Optimising Dynamic Constant Specialisation 127
Algorithm 2 Simple Target Function
Input: Δt = size of time step, ΔZ = size of asset price step
Constant: r and σ
Output: u = utility value for the optimisation routine
1: Calculate (α, β, γ): use Equation 5.2
2: Obtain the numerical differences εα, εβ and εγ from ideal constants (cα, cβ , cγ): εα/β/γ = α −
cα/β/γ
3: Accumulate penalty p1: to minimise εα, εβ and εγ
4: Accumulate penalty p2: to ensure the constraints (Equations 5.11) are met
5: Calculate utility value u: u = p1 + p2
Algorithm 3 Moment Matching Target Function
Input: Δt = size of time step, ΔZ = size of asset price step
Constant: r and σ
Output: u = utility value for the optimisation routine
1: Calculate (μ,Var , Skew ,Kurt): obtain mean, variance, skewness and kurtosis using the moment-
generating function of a standard normal distribution [122], e.g. μ and Var as in Equations 5.13,
Skew = 0 and Kurt = 3
2: Calculate (α, β, γ): by solving Equation 5.14
3: Calculate Skew ′,Kurt ′: the skewness and kurtosis of the stencil using the standard statistical
approach based on data points (fi+1,j−1, fi+1,j, fi+1,j+1) and their corresponding probabilities
(α, β, γ), an example is shown in Equation 5.17 and Equation 5.18
4: Calculate the difference between theoretical and stencil third and fourth order moments
εSkew , εKurt : εSkew = |Skew − Skew ′|, εKurt = |Kurt −Kurt ′|
5: Calculate u: based on εSkew , εKurt and α, β, γ
128 Chapter 5. Dynamic Constant Optimisation
 0
 0.2
 0.4
 0.6
 0.8
 1
 15  20  25  30  35  40
 0
 0.0005
 0.001
 0.0015
 0.002
 0.0025
 0.003
C
oe
ffi
ci
en
ts
 V
al
ue
s
R
el
at
iv
e 
E
rr
or
λ
α
β
γ
Relative Error
Trinomial Tree Setting
Figure 5.7: The relationship between values of the coefﬁcients {α, β, γ} and λ (left y-axis), where
λ = ΔZ/Δt; and the corresponding relative error between the EFD result and the Black Scholes
formula result (right y-axis). Note that Relative Error = |EFD Result−Black Scholes Result|;
and in this case the European option has the following parameters: S = 70, K = 70, t = 3.0, r =
0.05, σ = 0.9,Δt = 0.01.
the hardware resource consumption of the constant multipliers. To overcome this limitation, we
deﬁne a resource estimation function res(R) → N, which, given a constant coefﬁcient, returns a
positive integer representing the amount of hardware resource used by the corresponding constant
multiplier. Since each (Δt,ΔZ) combination can be mapped to the stencil coefﬁcients analytically,
it is possible to search in a limited (Δt,ΔZ) space for “nicer” constants such that dt ∈ [Δt− ε1,Δt]
and dZ ∈ [ΔZ−ε2,ΔZ], where ε1 and ε2 are small constants such that both the number of time steps
N and the number of asset price stepsM in the EFD grid stays the same as the original tree-equivalent
grid setting. Although in our case Δt and ΔZ have a limited range, a local optimal can still be found.
We therefore propose an improved version of the second approach, as shown in Algorithm 4.
To make the best use of the two approaches, we propose a workﬂow to ﬁnd small constant multipliers
for EFD dynamic kernels, while reducing the number of computational steps required with given
accuracy requirement. We assume that an option pricing problem o ∈ κ, the true value of the option
5.4. Optimising Dynamic Constant Specialisation 129
Algorithm 4 Improved Algorithm
Input: Δt = size of time step, ΔZ = size of asset price step
Constant: r, σ, ε1 and ε2
Output: α, β, γ = the coefﬁcients that yield smallest constant multipliers, without affecting the num-
ber of computational steps required for a given accuracy
1: Calculate Δt,ΔZ: using Algorithm 3 to obtain Δt and ΔZ which yield a tree equivalent EFD
scheme
2: Increasing both Δt and ΔZ to ﬁnd an EDF grid setting with less computational steps that meets
the given accuracy requirement
3: Set minimal hardware resource consumption u = ∞
4: for dt=Δt to Δt− ε1 do
5: for dZ=ΔZ to ΔZ − ε2 do
6: Calculate local α′, β′, γ′: using dt and dz
7: Calculate local hardware resource consumption u′: u′ = res(α′) + res(β′) + res(γ′)
8: if u′ < u then
9: α = α′, β = β′, γ = γ′
10: end if
11: end for
12: end for
v ∈ R, an error tolerance level e ∈ R, and a set of stencil coefﬁcients c ∈ R3 that yields small constant
multipliers are known beforehand. The workﬂow comprises the following steps:
(1) Label the ﬁrst approach with index 1. Based on o and c, use Algorithm 2 to ﬁnd ΔZ1 and Δt1;
then derive the stencil coefﬁcients (α1, β1, γ1), the number of time steps N1, and the number of
asset steps M1 for the EFD grid. Calculate the number of computational steps c1 = M1 × N1
in the grid.
(2) Based on results from (1), calculate the EFD option price v1, and ﬁnd the error e1 = v − v1. If
e1 < e, add index 1 to a set s which stores indices of valid approaches.
(3) If e1 < e, use the stencil coefﬁcients from (1) and the resource estimation function, to estimate
the resource consumption per EFD kernel r1 = res(α1) + res(β1) + res(γ1). We have now
obtained indicators c1, e1 and r1 for the ﬁrst approach.
(4) Repeat the ﬁrst three steps but label the base case with index 0 and use Algorithm 3, to obtain
indicators c0, e0 and r0 for the tree-equivalent EFD grid, add index 0 to s if e0 < e.
(5) Similarly, use index 2 and Algorithm 4 to obtain indicators c2, e2 and r2 for the second ap-
proach, then add index 2 to s if e2 < e.
130 Chapter 5. Dynamic Constant Optimisation
(6) Obtain area-computation-time product di = ci × ri, i ∈ s, choose approach i with di =
min(dj|j ∈ s).
It is worth considering the validity condition for applying the optimising approaches, given that the
optimised solution should not be slower than the original one. The execution time Td of the original
dynamic design can be calculated as:
Td =
cd × td
pd
+ tr,d (5.19)
where cd is the number of computational steps, td is the cycle time, pd is the number of processing
elements (kernels) in the dynamic design, and tr,d is the reconﬁguration time. Similarly, the execution
time Topt of the corresponding optimised design can be calculated as:
Topt =
copt × topt
popt
+ tr,opt + tsearch (5.20)
where copt, topt, popt and tr,opt denote the same parameters as above for the optimised design, and
tsearch denotes the time it takes to search for the better constants. For simplicity we assume the
optimisation process has negligible impact on cycle time, hence td = topt = tclk. To achieve a
performance beneﬁt, the following condition is required:
Topt < Td
copt × tclk
popt
+ tr,opt + tsearch <
cd × tclk
pd
+ tr,d
tsearch < tclk
(
cd
pd
− copt
popt
)
+ (tr,d − tr,opt) (5.21)
In this experiment the search time tsearch is negligible since it is possible to search for the better
constants beforehand for each option and use table lookup to load designs at runtime, just like the
original dynamic design. We therefore have:
0 < tclk
(
cd
pd
− copt
popt
)
+ (tr,d − tr,opt) (5.22)
If cd > copt, pd < popt and tr,d > tr,opt, the validity of the optimisation process is guaranteed. In our
5.5. Results 131
experiment, c depends on N and M , while p and tr depends on the reconﬁgurable area. Therefore, if
the optimisation process reduces both the reconﬁgurable area and N ×M , it will beneﬁt the overall
execution time.
5.5 Results
In this section we examine the effectiveness of the optimising process as discussed in Section 5.3 and
Section 5.4.
For dynamic constant specialisation, the FPGA implementation is based on a double-precision ﬂoating-
point arithmetic software implementation. The design is implemented on the FPGA using the FloPoCo
generator [123]. FloPoCo is an open-source generator for ﬂoating-point and ﬁxed-point arithmetic
cores. After analysing numerical precision and range, a ﬁxed-point version of the algorithm on the
FPGA is developed, with customised ﬁxed-point number format (8 bits integer and 43 bits fraction)
that provide equal precision. The ﬁxed-point version is then specialised for dynamic constant recon-
ﬁguration as described in section 5.3.
For the optimisation of dynamic constant specialisation, the Nelder-Mead multi-variable minimisation
routine in the GNU scientiﬁc library is used to implement the algorithms described in Section 5.4.2.
The result is produced according to the error tolerant level used in industry2, which requires the
difference between the reduced precision result and the double precision result to be smaller than
2E−4; however, the workﬂow can be tuned to accommodate arbitrary error tolerant level by adjusting
the number format. The 23 bit ﬁxed point number format with 1 bit for integer and 22 bit for fraction
is used in our implementations, as it has been used in our previous work [28] and the ﬁnite precision
error in the result is always below 2E − 4 compared to double precision ﬂoating point result for the
EFD computation. The FloPoCo library [123] is used to generate dynamic kernel descriptions with
different stencil coefﬁcients in VHDL. The FloPoCo library is also used as the resource estimation
function in our workﬂow. All optimised dynamic designs found by our workﬂow are compared to the
23 bit ﬁxed point version of unoptimised dynamic designs.
2Private correspondence with J.P. Morgan Chase
132 Chapter 5. Dynamic Constant Optimisation
operator Mult Add Max Mult
type static (8,46) dynamic (8,46)
t [ns] 3.11 2.16 2.17 2.49
A [LUT/FF] 5829 54 81 1015
type static (7,16) dynamic (7,16)
t [ns] 3.02 2.10 2.14 2.45
A [LUT/FF] 1774 17 25 309
Table 5.1: Cycle time and area of arithmetic operators. Area is measured in LUT/ﬂip-ﬂop pairs.
static dynamic (8,46) dynamic (7,16) opt. dynamic
est. real est. real est. real real
t [ns] 3.11 4.76 2.49 4.31 2.45 4.12 4.32
A [LUT/FF] 17676 13759 3234 2977 986 906 710
p 26 34 145 158 477 519 662
Table 5.2: Cycle time, area and number of processing elements for the static and dynamic core.
Shown are the estimated values from our model (est.) and values for a real design.
The ﬁxed point error analysis is based on the MPFR library [108]. All VHDL kernel descriptions are
placed and routed on a Xilinx Virtex-6 XC6VLX760 FPGA using ISE 13.2 implementation tools.
5.5.1 Dynamic Constant Specialisation
We now estimate the performance and area of static and dynamic designs based on the analysis pro-
cedure steps that is outlined in section 5.3.2. As per steps 1 and 2, the required arithmetic operators
are built to measure their cycle time t and area A. The post place-and-route results for the three
static operators as well as one specialised, dynamic multiplier are shown in Table 5.1. We use these
operators to estimate the performance and area of static and dynamic versions of an American option
pricing core as shown in Figure 5.3. Table 5.2 lists the estimated cycle time and area. For illustrative
purposes, the numbers are compared against values that are obtained from a real implementations of
the option pricing core.
In step 3 we determine the number of computational steps n that need to be processed. Our option
pricing application is based on 2.2k × 7.3k grids which means that a total of 1.6 · 107 computations
per grid need to be performed.
5.5. Results 133
full reconﬁguration partial reconﬁguration
1 Coarse Core 8 Coarse Cores
conﬁguration SW host fast fast internal [124]
mechanism application external
conﬁg. speed 14.4 MB/s 200 MB/s 300 MB/s
tr [ms] 1600 115 9
speedup 0.001 0.01 1.5
Table 5.3: Estimated speedup of the dynamic design using (8,46) ﬁxed point arithmetic. The estimates
are based on various conﬁguration times tr.
To calculate the available design area Adesign (step 4) we ﬁrst consider full reconﬁguration where
the entire device contains only one Coarse Core, composed of as many Fine Cores as possible. We
obtain the total number of logic resources from the device data sheet and subtract the logic resources
for control shown in Figure 5.2. With Adesign we can calculate p, the number of Fine Cores that the
device can support (Table 5.2).
When reconﬁguring the system from a software host application, the measured reconﬁguration time
tr is 1.6 s (step 5) and with this Equation 5.9 (step 6) can be evaluated. For the given parameters,
Equation 5.9 is not true which indicates that reconﬁguration will not be beneﬁcial. Table 5.3 lists the
estimated speedup of the dynamic design over the static one. It can be seen that the dynamic design
is clearly hampered by the long reconﬁguration time. The speedup for the case when the device is
reconﬁgured with the maximum external conﬁguration speed of 200MB/s is also estimated as shown
in Table 5.3. This results in a conﬁguration time of 115ms; however, this is still slower than the static
design.
To improve reconﬁguration speed we now explore several design options using partial reconﬁgura-
tion. As explained in Section 5.2, given a ﬁxed number of Fine Cores, the design can be scaled by
implementing more Coarse Cores, while reducing the number of Fine Cores in each Coarse Core. This
reduces the size of each Coarse Core while the overall design size remains constant. Coarse Cores
can be placed in partially reconﬁgurable areas which can be reconﬁgured independently. Hence, an
increase in the number of Coarse Cores leads to smaller reconﬁgurable areas and faster reconﬁgura-
tion.
The test is set in a way such that there are always enough option payoff evaluation queries to keep
134 Chapter 5. Dynamic Constant Optimisation
all Coarse Cores busy. For example, if the target design has a maximum of 16 Coarse Cores, there
must be at least 16 queries so that all the Coarse Cores are busy and at least one reconﬁguration
happen per Coarse Core during the execution. For convenience it is assumed that each Coarse Core
contains the same number of Fine Cores. As a result the number of Fine Cores per Coarse Core
is conﬁgured as pfine = p/pcoarse where pcoarse is the number of coarse cores in the design. Since
run-time reconﬁguration is not currently supported by the hardware in use, the reconﬁguration time
tr is estimated by the conﬁguration speed shown in Table 5.3. Table 5.3 also shows a design with 8
Coarse Cores that can be partially and independently reconﬁgured. A reconﬁguration time of 9ms
can be obtained with a fast, internal reconﬁguration mechanism that provides a conﬁguration speed
of 300MB/s [124]. With the signiﬁcantly reduced reconﬁguration time, our estimation indicates that
an overall speedup can be achieved.
Figure 5.8 shows performance estimates for designs with the number of partially reconﬁgurable
Coarse Cores ranging from 1 to 16. The estimates that are based on single operators (Table 5.1)
are compared to results from actual kernel implementations. We can observe that for a larger number
of Coarse Cores, the dynamic designs can signiﬁcantly reduce execution times. In the case of 16
Coarse Cores, if the (8,46) ﬁxed point number format is used, which is shown as Dynamic1 in the
ﬁgure, the execution time for pricing one option is reduced from 38.3ms to 11.7ms which represents
a speedup by a factor of 3.3; if the (7,16) ﬁxed point number format is used, which is shown as Dy-
namic2 in the ﬁgure, a speedup of 5.6 times is achieved. It can also be seen that the estimations from
the proposed analytics based on single operators can deliver results in sufﬁcient accuracy.
5.5.2 Optimising Dynamic Constant Specialisation
We ﬁrst give an example for choosing between the two approaches in our workﬂow shown in Sec-
tion 5.4.2, in terms of relative error, hardware resource consumption and number of computational
steps. It is assumed that the number of time steps (N ) is provided by the user and therefore dt is ﬁxed.
The results and a static double precision EFD kernel implementation reference are presented in Ta-
ble 5.4. The relative error is obtained by comparing the reduced precision results from the four designs
to the double precision result from the Black Scholes formula. It can be seen that although the ﬁrst
5.5. Results 135
???????
???????
???????
??????
???????
???????
???????
???????
??????
???????
????????
????????
?????
???????
???????
???????
???????
???????
????
?????
??????
???????
????????
? ? ? ? ??
??
??
??
??
??
???
???
??
??
???
??
???
??
??
?????????????????
??????????
??????????
????????????
???????????
??????????????
?????????????
??????????????
?????????????
Figure 5.8: Estimated and real performance of the static and reconﬁgurable American option pricing
design for various numbers of coarse cores. Dynamic1 is based on (8,46) ﬁxed point numbers; Dy-
namic2 is based on (7,16) ﬁxed point numbers. Note that the estimated performance is calculated by
the method proposed in Section 5.3, and real performance is based on place and route results.
approach has the smallest relative error, it requires over 10 times more computational steps compared
to the second approach, since the ﬁrst approach requires a two dimensional search space (Δt,ΔZ),
we have no control over the total number of grid points or result precision. The LUTs result in Ta-
ble 5.4 shows the place and route result of the kernels under consideration; it is clear that the hamming
weight is not directly correlated to the amount of LUT resource per dynamic kernel, since the number
of LUT consumed by the kernel generated using the ﬁrst approach is larger than the original kernel.
On the other hand, the second approach can reduce the amount of hardware resource consumption as
well as the number of computational steps, compared to the original EFD kernel; it also provides the
same level of accuracy compared to the original scheme. Therefore the dynamic kernel generated by
our second approach is chosen for our workﬂow.
We now analyse the upper bound of reconﬁgurable area of the dynamic EFD kernels found by our
workﬂow, and try to prove that it can reduce the required amount of preserved reconﬁguration area, as
well as reducing the number of computational steps. One thousand American options are randomly
generated with different parameters covering the types of parameters observed in the market; and
136 Chapter 5. Dynamic Constant Optimisation
Rel. Err. LUTs N M Comp. Steps
Double Static 1.55E-4 13759 365 420 1.36E5
Original 23bit Dynamic 1.52E-4 732 365 420 1.36E5
First Approach 1.2E-7 881 1146 1818 2.08E6
Second Approach 1.41E-4 603 365 324 1.05E5
Table 5.4: Comparison between the two approaches discussed in Section 5.4.2. The target European
option has the following κ: (70, 70, 0.05, 1.0, 0.3). Note that the LUTs result is based on place and
route results
each is solved by an EFD grid with N = 365, which is a common industry setting for options with
time-to-maturity of one year under daily observation.
In the original tree-equivalent EFD grid setting, M is a constant equal to 420; and in our workﬂowM
is determined by the error tolerance. All EFD results are compared to the standard double precision
EFD result with N = 2000, which we consider to be the true result, to make sure the error tolerance
is not exceeded. To minimise the noise from discretisation error, we assume that all options are at the
money so that the strike price lies exactly on the grid. Figure 5.9 shows the comparison of LUT usage
per kernel between the original and optimised dynamic kernels. It can be seen that the optimised
kernel always consume fewer LUTs than the original unoptimised kernels. The maximum LUT usage
has been reduced from 906 to 710, indicating a reduction of 22%. Since N is ﬁxed, the average
number of computational steps depends only on M . Our results indicate that we have reduced the
average of M from 420 to 324, which is a 23% reduction. As a result, the area-computation-time
product is reduced by 40%.
The experiment as discussed in Section5.5.1 is repeated applying this optimisation procedure to eval-
uate its effectiveness. The execution time of optimised dynamic designs are compared to those of
original dynamic designs (Dynamic2 in (7,16) ﬁxed point number format). The result is shown in
Figure 5.10. It can be seen that the optimised dynamic designs are up to 50% faster than the unop-
timised ones in execution time. As expected, the fastest speed is achieved when the 64 Coarse Core
conﬁguration is used.
In reality it is not feasible to run the optimisation procedure before the dynamic reconﬁguration,
therefore an adaptive look up table based approach should be adopted. From Equations 5.13 and 5.14
we know the problem space is deﬁned by (r, σ,Δt,ΔZ), it is independent of S, K and t. We know
5.6. Summary 137
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 0  100  200  300  400  500  600  700  800  900  1000
N
um
 L
U
Ts
 p
er
 E
FD
 K
er
ne
l
Option ID
Original, Max=906
Optimised, Max=710
Difference
Figure 5.9: Comparison of LUT usage per kernel between the standard and optimised EFD kernel,
based on place and route results.
that the interest rate is relatively stable with limited number of variations, and Δt and ΔZ are the
start values of the optimisation which are usually speciﬁed beforehand therefore have limited number
of variation. The only unbounded variable is σ. It is possible to build a look up table containing
all common combination of (r, σ,Δt,ΔZ), and their pre-calculated constant sets. At run time, if a
combination is found, the run time dynamic conﬁguration routine is activated; otherwise the procedure
automatically falls back to a static routine and at the same time call the optimisation in the background,
so that the new combination is known at the next time. In such a look up table is maintained to be
more and more complete, and eventually the fall back static routine will not be needed.
5.6 Summary
This chapter demonstrates a new method based on dynamic reconﬁguration to accelerate stencil com-
putation by adjusting the underlying numerical procedure. Using EFD option payoff evaluation as an
138 Chapter 5. Dynamic Constant Optimisation
????? ????? ?????
?????
??????
??????
??????
?????
??????
??????
??????
??????
??????
??????
????
?????
??????
???????
????????
? ? ? ? ?? ?? ??
??
??
??
??
???
??
???
??
??
???
??
???
??
??
?????????????????
???????
????????
????????????
Figure 5.10: Performance comparison of the optimised and unoptimised American option pricing
design for various numbers of coarse cores. Note that this is based on place and route results.
example, the method is able to identify coefﬁcients for constant multipliers, such that both the amount
of required hardware resources per kernel, and the amount of computation required per problem are
reduced signiﬁcantly. The method is general enough to be applied to other stencil computations, such
as Reverse Time Migration [117].
The method involves dynamic reconﬁguration of constants to reduce hardware resource consumption.
It supports both full and partial run-time reconﬁguration, and the proposed analytical treatment allows
designers to quickly evaluate the conditions for which the proposed approach would be beneﬁcial.
In addition to that, the method optimises stencil computation from the algorithm level. As a result,
less area is needed to support run-time reconﬁguration, and less computational steps are required in
the numerical procedure to obtain a result with a given error tolerance.
In a case study on a Virtex-6 XC6VLX760 device, it is shown that using area and cycle time param-
eters from arithmetic operators is a viable technique for estimating design performance. Although a
design using full reconﬁguration is not beneﬁcial due to the long reconﬁguration time, with partial
5.6. Summary 139
reconﬁguration, we can achieve a 5.6 times speedup over a static design without stencil computation
optimisation.
To study the effectiveness of stencil computation optimisation, in a case study on the same Virtex-
6 XC6VLX760 FPGA, one thousand randomly generated 23 bit ﬁxed point implementations after
optimisation are compared to those before optimisation. The results show the upper bound of recon-
ﬁguration area is reduced by 22%; the average number of computational steps is reduced by 23%; and
the area-computation-time product is reduced by 40%; while the numerical errors of the results are
kept below the error tolerant level used in industry. In terms of execution time, the optimised dynamic
design is 50% faster than the unoptimised dynamic design.
Chapter 6
Conclusion and Future Work
6.1 Summary of Contributions
This research has contributed to ﬁnancial computing for reconﬁgurable hardware and fulﬁlled the
main objectives described in Chapter 1 (Customisation in multiple levels, Architectural generalisation
for ﬁnancial payoff valuation, and Design optimisation based on customisability and reconﬁgurabil-
ity). There are the three main achievements:
• Firstly, an application independent Monte Carlo framework for interest rate derivatives payoff
evaluations based on the HJM model is illustrated. By identifying three levels of functional
specialisations in the model, a manually optimised component, a templated component and a
user programmable component are utilised in the framework to form a complete architecture,
so as to retain good performance and support multiple applications. In addition, a process is
introduced to identify the optimal reduced precision data representation in order to ensure better
utilisation of hardware resource without affecting the result accuracy. The generated FPGA
implementations have signiﬁcant computational speedups and energy savings, compared to the
corresponding GPU and CPU designs.
• Secondly, a new method is proposed for accelerating option payoff evaluations based on ﬁnite
difference method automatically. Parallelism in the computation is exploited in two levels, intra
140
6.1. Summary of Contributions 141
and inter option evaluations. The designs are generated from a high level description; they are
highly pipelined and are capable of evaluate multiple options concurrently. The results show
signiﬁcant computational speedup and energy saving, when comparing reconﬁgurable designs
to GPU and CPU designs.
• Thirdly, a novel method to accelerate stencil computation by optimally adjusting the underlying
numerical procedure is proposed, which exploits the reconﬁgurability of reconﬁgurable hard-
ware. Based on dynamic reconﬁguration, the method is able to identify optimal coefﬁcients for
constant multipliers, such that both the amount of required hardware resources per kernel, and
the amount of computation required per problem are reduced signiﬁcantly. The results show
signiﬁcant speedup is achieved by comparing dynamic designs to the corresponding static de-
signs.
Table 6.1: Summary of the key results
Research Area Key results
Multi-level
Customisation
(Chapter 3)
• An application independent Monte Carlo framework for interest rate derivatives
payoff evaluations based on the HJM model. The automatically generated FPGA
implementations are generally over 10 times faster and over 20 times more energy
efﬁcient than a four core CPU.
• A process for the FPGA platform to identify the optimal reduced precision data
representation to ensure better utilisation of hardware resource without affecting
result accuracy. The result shows that, by adopting optimal number representation
in the datapath, the memory resource usage is reduced by 15 to 20 times, allowing
better utilisation of logic resource while retaining result accuracy.
Explicit Finite Difference
Method
(Chapter 4)
• The FPGA designs under the proposed novel parallel architectures using Explicit
Finite Difference method are over 28 times and 1.2 times faster than comparable
CPU and GPU designs.
• The FPGA designs are over 100 times and 17 times more energy efﬁcient than
comparable CPU and GPU implementations.
Dynamic Constant Opti-
misation
(Chapter 5)
• a new method based on dynamic reconﬁguration to accelerate stencil computation
by optimally adjusting the underlying numerical procedure. By using dynamic
constant reconﬁguration only, a speedup of 5 times is achieved comparing the
unoptimised dynamic design to the static design.
• By optimally adjust the stencil of the example EFD procedure, both the hardware
resource consumption per kernel and the amount of computation per problem are
reduced, with given accuracy requirement. The optimised dynamic design is 50%
faster than the unoptimised dynamic design.
A summary of the key results in each chapter is shown in Table 6.1. These results are important
in ﬁnance computations for reconﬁgurable hardware as they provide detailed architectural design
142 Chapter 6. Conclusion and Future Work
insights as well as performance and energy efﬁciency proﬁles for various computational platforms
such as FPGAs, GPUs and CPUs. These are useful to the industry for making strategic decisions
on designing and creating derivatives payoff evaluation frameworks based on hardware accelerators.
In addition, the architectures, generalisation and optimisation methodologies presented in this thesis
can be applied to industry level payoff evaluation infrastructures to increase performance and reduce
energy consumption.
6.2 Impacts
This thesis leads to three main impacts. Firstly, designing reconﬁgurable hardware architectures based
on multi-level optimisation provides both generality and performance which are demanded from the
ﬁnance industry for adopting reconﬁgurable hardware to its already-complex infrastructure. Secondly,
the general purpose ﬁnite difference and Monte Carlo architectures described in this thesis provide
viable solutions to fulﬁll the high computational capacity and high energy efﬁciency demanded by
the industry. Lastly, the design optimisation methodologies proposed in this thesis, including optimi-
sations based on reduced precision data representation and dynamic constant reconﬁguration, provide
novel perspectives in accelerating algorithms using reconﬁgurable hardware.
6.2.1 Fulﬁll Computational capacity and Energy Efﬁciency Demand from the
Finance Industry
The companies in ﬁnance industry seek innovations continuously in order to maintain their competi-
tive advantage. Before 2008 the ﬁnance industry seeks innovation in the complexity of its products.
Ever since the ﬁnancial crisis in 2008, the ﬁnancial companies have been demanding even higher
computational power to understand their risk position in the market, and to meet regulatory require-
ments. In either case it is challenging to accelerate ﬁnance computations in order to meet the business
requirement. The risk and scenario analysis banks run to understand their risk exposure in their port-
folios are so computationally intensive that they need dedicated computer farms with thousands of
6.2. Impacts 143
CPUs to run overnight. A study estimated that 18,118 billion kWh of electricity was consumed by
41 million servers in data centers alone on this planet in year 2010, including cooling and power
distribution losses [5], with each server consuming 442,000 kWh yearly. This means that each server
equivalently emits 232 tons of carbon dioxide per year [125]. With the number of CPU cores dou-
bling everything year in large investment institutions1, increasing computational capacity by installing
more traditional servers is no longer ﬁnancially or environmentally sustainable. It is therefore very
important to set our goal to do computations in an energy efﬁcient manner.
In Chapter 3, it is shown that our general purpose HJM based design on reconﬁgurable hardware
with multi-level optimisation and reduced precision optimisation is over 10 times faster and 20 times
more energy efﬁcient than highly optimised implementations on a comparable CPU. If reconﬁgurable
hardware is used in a computer farm of 1000 servers, it would approximately reduce the carbon
dioxide emission by 220400 tons per year. This would impact the ﬁnance industry signiﬁcantly as
(a) the architecture is general enough to be used realistically in ﬁnance industry (b) the intensive
computational demand can be met in an energy efﬁcient and environmental friendly manner.
6.2.2 Novel Perspectives in Optimising Acceleration Techniques Based on Re-
conﬁgurable Hardware
Since 1959, computational technology has been driven by Moore’s Law [6]. Before 2005, single
core CPUs beneﬁted from Dennard scaling [7] which allows their performance to grow exponentially
every year. This means the same computation will run twice as fast every two years, without changing
the existing software implementation.
After 2005 when Dennard scaling had ended [8], CPUs with multiple cores are designed to gain extra
computational power, as oppose to performance improvement in a single core. However, increasing
the core counts on a chip leads to “dark silicon” [9], which is a phenomenon that occurs when the
power density of a chip is too high to allow all its components to be powered at once. Due to this
limitation, it is predicted that the processors in 2024 will be about 7.9 times faster than the processors
1Private correspondence with J.P. Morgan.
144 Chapter 6. Conclusion and Future Work
in 2008, far away from the 32 times prediction according to Moore’s law. These facts have driven the
computational technology to become more efﬁcient and diversiﬁed, with each component optimised
for its task [10].
These facts mean that, computational tasks must be dispatched in a way that ensures optimal utilisa-
tion of the computational technology it is based on. Such optimisation is commonly seen in traditional
high performance computing research, such as [24]. For optimisations based on reconﬁgurable hard-
ware, there are two properties available exclusively: customisability and reconﬁgurability. In this
thesis, precision optimisation described in Chapter 3 and dynamic constant specialisation optimisa-
tion in Chapter 5 make use of the two properties stated above. The effectiveness of the optimisation
techniques are studied experimentally in detail and the optimised designs are compared with the cor-
responding non-optimised designs. In addition, Chapter 5 has shown that extra optimisation can be
done by ﬁne tuning the underlying numerical procedure optimally to make use of the properties of
reconﬁgurable hardware. These novel perspectives are important for designing energy efﬁcient high
performance reconﬁgurable hardware accelerators in the future.
6.2.3 Generality and Performance for Finance Applications
The HiPEAC (European Network of Excellence on High Performance and Embedded Architecture
and Compilation) road map [10] indicates that there is a trade-off between generality and efﬁciency
for computational technologies in general. This is usually true as more work needs to be done per
task by the underlying platform in order to support the extra generality.
However we have shown in Chapter 3 that for a particular domain such as ﬁnance applications for a
category of derivatives, reconﬁgurable hardware can have both generality and performance, by care-
fully partitioning the components in a design and optimising them separately based on the function-
ality of each component. Such optimisation methodology has important impact on ﬁnance industry
that core analytics based on reconﬁgurable hardware can be shared by multiple types of derivatives
easily without compromising performance.
In addition, the concept of multi-level customisation hides low level details from the user where
6.2. Impacts 145
necessary, which allows non-hardware-expert users to make use of the framework based on reconﬁg-
urable hardware. This has impact on ﬁnance industry where managing complexity is very important
to deliver solutions in a timely manner.
6.2.4 Further Impact
In addition to the three direct impacts mentioned above, there are many examples and related works
indicating the performance gap between FPGA and CPU/GPU are getting larger, despite the fact
that FPGAs are more energy efﬁcient. Earlier research show that, FPGAs are usually faster than
comparable CPUs of an order of 10 times and are usually as fast as GPUs. For example, an earlier
design for Quadrature Methods for option payoff valuation shows speedup of 10 times and 2 times
over a comparable CPU core and a GPU respectively [47]. This is because the design optimisations
are focused on the arithmetic level and clock cycle level only. The targets of such optimisation
are to minimise the number of arithmetic operations implemented on hardware so that the hardware
consumption per kernel is reduced, and to have highly pipelined designs so that the hardware does
useful work at each clock cycle.
More recent designs usually show higher speedup. For example, a MC simulation design with mixed
precision optimisation is over 100 times faster than a four core CPU and over 4 times faster than a
comparable GPU [60]. In addition, an FPGA based derivative computations project used in ﬁnance
industry shows a speedup of over 200 times over a comparable CPU core [12]. This is mainly because,
in addition to the arithmetic level and clock cycle level optimisation, the recent methodologies also
make use of customised number representation and dynamic reconﬁguration. These two dimensions
in the design space allow the utilisation of unused precision in data presentation, and unused hardware
resource in the temporal domain. As a result, the hardware resource consumption per kernel are
reduced signiﬁcantly, allowing better parallelism in the design.
This trend is attracting more attention to the reconﬁgurable hardware based research in general, not
limited to ﬁnancial computation. New optimisation techniques are continuously being adopted for
better performance on reconﬁgurable hardware. For example, by making use of the optimal number
146 Chapter 6. Conclusion and Future Work
representation, an FPGA implementation of MCMC algorithms for Bayesian inference indicates a
speedup of more than 200 times and more than 16 times over a comparable CPU and a compara-
ble GPU respectively [126]; based on a dataﬂow engine, a Gaussian Copula Model implementation
design shows a speedup of over 400 times versus a comparable CPU core and over 40 times ver-
sus a comparable MPI solution [127]; and by utilising a hybrid CPU-FPGA algorithm and mixed-
precision arithmetic, an FPGA based global atmospheric equations solver achieved over 100 times
speedup compared to a 6 core CPU and over 4 times speedup than a comparable hybrid CPU-GPU
system [128].
As the performance gap between FPGA and conventional computational devices widens, due to the
improvement of hardware technology and optimisation techniques, together with the inherent low
energy consumption of FPGA devices, it can be seen that in the foreseeable future reconﬁgurable
devices will be more ubiquitous as a computational device, providing both computational power and
energy efﬁciency for different computational purposes.
6.3 Future Work
6.3.1 Finite Difference Method
One of the main numerical methods covered in this thesis is the Finite Difference method. In Chap-
ters 4 and 5 we presented architectures and optimisation methodologies for derivatives payoff eval-
uations based on explicit ﬁnite difference method. In this section we discuss future work related to
ﬁnite difference method.
Higher Dimensional and Higher Order Explicit Finite Difference Method
Higher dimensional explicit ﬁnite difference method can be applied in situations where multiple un-
derlyings exist for a ﬁnance derivative. Due to quadratically increasing memory and computational
requirement, only up to three dimensions can be practically calculated in ﬁnance applications, higher
6.3. Future Work 147
dimensional calculations are usually more efﬁciently calculated by Monte Carlo simulations. Out of
the scope of this thesis, we have studied three dimensional Explicit Finite Difference methods for
ﬁnancial option payoff evaluation on reconﬁgurable hardware [129]. The experimental result shows
a speedup of 9.6 times and energy efﬁciency improvement of 18.6 times over a comparable CPU.
Higher order Explicit Finite Difference methods has smaller truncation error compared to lower order
ones [130], with the cost of extra computation per step. Future work includes applying the optimisa-
tion techniques presented in Chapter 5 to higher order and higher dimension Explicit Finite Difference
methods and study the effectiveness of dynamic constant reconﬁguration to these methods.
Non-explicit Finite Difference Method
Apart from Explicit Finite Difference method, there are other types of popular Finite Difference
methods used in ﬁnance industry such as Implicit [130] and Crank-Nicolson [131]. Instead of using
convolution for each grid point, they require to solve a system of linear equations at each time step.
Since Implicit and Crank-Nicolson methods are unconditionally stable [61], they are often used when
the Explicit method is not stable to use.
Although there is research for solving tridiagonal systems in other platforms such as GPUs [132], pub-
lished research on reconﬁgurable hardware is based on tridiagonal matrix algorithm (TDMA) [133]
which is not scalable. One of the future work is to investigate possible parallelised algorithms and
solve tridiagonal systems in reconﬁgurable hardware in a scalable manner, which can then be used in
ﬁnance applications.
Finite Difference Method beyond Finance Computation
Accelerating Finite Difference method using reconﬁgurable hardware can be applied to ﬁelds beyond
ﬁnance computation. The architectures and methodologies mentioned in Chapters 4 and 5 are not
exclusive to stock based option evaluations, they are also applicable to ﬁelds such as thermodynam-
ics [51], electromagnetism [52] and interest rate derivatives modelling [134]. Research can be done
148 Chapter 6. Conclusion and Future Work
to generalise and extend the architectures and methodologies so that they can be used in different
research ﬁelds.
6.3.2 Monte Carlo Method
Monte Carlo method for options payoff evaluation is covered in Chapter 3. In this section we discuss
future work related to the Monte Carlo method.
Monte Carlo Method beyond Finance Computation
Mote Carlo methods have a wide range of applications beyond ﬁnance computation. Examples in-
clude: Molecular modelling in physics [135], Bayesian models in machine learning [19, 136] and
photohadronic processes modelling in astrophysics [137]. All these applications can beneﬁt from
reconﬁgurable hardware with the simulation core designed in hardware.
The techniques described in Chapter 3 to manage complexity and optimise precision for derivatives
payoff evaluation is also applicable to the above applications.
Monte Carlo Methods with Variance Reduction Techniques
Variance reduction techniques are widely used in practice to ensure fast convergence for Monte
Carlo simulations. Common variance reduction techniques include: low-discrepancy sequence [138],
Brownian Bridge [139], control variate [140], antithetic variates [141], conditional Monte Carlo [142].
Research has shown that certain reconﬁgurable hardware implementations utilising low-discrepancy
sequence and Brownian bridge have signiﬁcant performance advantage over a comparable CPU [143].
Similar techniques can be applied to the Monte Carlo architecture described in Chapter 3, and the
trade-off between design complexity and result convergence can be studied. In addition, it is also
important to study the design of optimal architectures for variance reduction techniques, such as
random number generators for low-discrepancy sequence, and design patterns for Brownian Bridge.
6.3. Future Work 149
6.3.3 Financial Model Calibration
Model calibration plays an important part in the derivatives payoff evaluation process, it ensures that
the model payoff is consistent with market quotations. The calibration of a model is performed by
observing the prices of certain derivatives written on the underlying, and ﬁtting the parameters of the
model, in such a way that it reproduces the observed derivative prices [144].
One important type of model calibration involves constructing an implied volatility surface based on
observable market quotations. The volatility surface can then be used to by local volatility models
such as [145], which is an extension of the Black-Scholes model to make it compatible with the
observed market volatility smiles. Simple models can make use of volatility interpolation techniques
which is based on implicit ﬁnite difference method and can be calculated relatively quickly [146].
In cases which such technique does not exist, for example foreign exchange derivatives, a volatility
surface needs to be constructed based on each individual implied volatility observed from the market
by splining techniques such as [147, 148]. A splined volatility curve at expiry time t, as part of a
volatility surface at a particular tenor T is shown in Figure 6.1, where the entire volatility cube is a
four dimensional hypercube consists of strike price, tenor, time to maturity and volatility.
Constructing the data points in the volatility cube involves root searching techniques such as Newton-
Raphson, as shown in Equation 6.1, where an example of function f is shown in Equation 6.4, which
is the Garman-Kohlhagen formula. In Equation 6.4, fcall is the payoff of a foreign exchange call
option, S is the exchange rate, t is time to maturity of the option, rf rd is the interest rate of the
foreign currency and domestic currency respectively, K is the strike price of the option, and σ is the
volatility.
σn+1 = σn − f(σn)
f ′(σn)
(6.1)
150 Chapter 6. Conclusion and Future Work
??
???
????
?
??
??
??
??
?
????????
????????????
????????????????
?????
????????
???
???
??? ???
??????????????????
?????
?
?
?? ????? ???????????????
???????????????????
??????????
Figure 6.1: A splined volatility curve at expiry time t, as part of a volatility surface at a particular
tenor T . Note that the entire volatility cube is a four dimensional hypercube consists of strike price,
tenor, time to maturity and volatility.
fcall = S ×N(d1)× e−rf t −K ×N(d2)× e−rdt (6.2)
d1 =
ln( S
K
) + (rd − rf + σ22 )t
σ
√
t
(6.3)
d2 = d1 − σ
√
t (6.4)
The hypercube constructing process is computationally intensive and involves variables such as rd,
rf and t which changes less often than σ. Dynamic constant reconﬁguration techniques described in
Chapter 5 can therefore be used to accelerate the ﬁnancial model calibration process.
Bibliography
[1] D. Dufﬁe and N. Garleanu. Risk and valuation of collateralized debt obligations. Financial
Analysts Journal, pages 41–59, 2001.
[2] Marc Sumerlin and Loren M. Katzovitz. Collateralized debt obligations. International Econ-
omy, pages 12–63, 2007.
[3] Jeremy Smith. Volume growth in the fx market is co-operative processing the future?, 2010.
[4] Martin B. Haugh and Andrew W. Lo. Computationalchallenges in portfoliomanagement.
COMPUTING IN SCIENCE AND ENGINEERING, 3:54 –59, 2001.
[5] Jonathan G. Koomey. Growth in data center electricity use 2005 to 2010. Technical report,
Stanford University, 2011.
[6] Gordon E. Moore. Cramming more components onto integrated circuits. Proceedings of the
IEEE, 86:82–85, 1998.
[7] R.H. Dennard, F.H. Gaensslen, Hwa-Nien Yu, V.L. Rideout, E. Bassous, and A.R. Leblanc.
Design of ion-implanted mosfet’s with very small physical dimensions. Proceedings of the
IEEE, 87(4):668 –678, 1999.
[8] Mark Bohr. A 30 year retrospective on dennard’s MOSFET scaling paper. IEEE Solid-State
Circuits Newsletter, 12(1):11 –13, 2007.
[9] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug
Burger. Dark silicon and the end of multicore scaling. In Proc. of the Int. Symp. on Computer
architecture, pages 365–376, 2011.
151
152 BIBLIOGRAPHY
[10] M. Duranton, D. Black-Schaffer, S. Yehia, and K. De Bosschere. Computing Systems: Re-
search Challenges Ahead The HiPEAC Vision.
[11] John von Neumann. The ﬁrst draft report on the EDVAC. IEEE Annals of the History of
Computing, 15:27–75, 1993.
[12] Stephen Weston, James Spooner, Sebastien Racaniere, and Oskar Mencer. Rapid computation
of value and risk for derivatives portfolios. Concurrency and Computation: Practice and
Experience, 24:880–894, 2011.
[13] O. Pell and V. Averbukh. Maximum performance computing with dataﬂow engines. Computing
in Science Engineering, 14:98 –103, 2012.
[14] S. Walters. Computer-aided prototyping for ASIC-based systems. Design Test of Computers,
IEEE, 8:4 –10, june 1991.
[15] Haoyu Song and John W. Lockwood. Efﬁcient packet classiﬁcation for network intrusion
detection using FPGA. In Proc. Int. Symp. on Field-Programmable Gate Arrays, pages 238–
245, 2005.
[16] K.H. Leung, K.W.Ma, W.K.Wong, and P.H.W. Leong. FPGA implementation of a microcoded
elliptic curve cryptographic processor. In Int. Symp. on Field-Programmable Custom Comput-
ing Machines, pages 68 –76, 2000.
[17] S. Benner, R.J.A. Chen, N.A. Wilson, R. Abu-Shumays, N. Hurt, K.R. Lieberman, D.W.
Deamer, W.B. Dunbar, and M. Akeson. Sequence-speciﬁc detection of individual dna poly-
merase complexes in real time using a nanopore. Nature nanotechnology, 2:718–724, 2007.
[18] C.H. Ho, K.H. Tsoi, H.C. Yeung, Y.M. Lam, K.H. Lee, P.H.W. Leong, R. Ludewig, P. Zipf,
A.G. Ortiz, and M. Glesner. Arbitrary function approximation in HDLs with application to the
N-body problem. In Proc. Int. Conf. on Field-Programmable Technology (FPT), pages 84 –
91, 2003.
BIBLIOGRAPHY 153
[19] Grigorios Mingas and Christos-Savvas Bouganis. Parallel tempering MCMC acceleration us-
ing reconﬁgurable hardware. In Proc. Int. Conf. on Applied Reconﬁgurable Computing, pages
227–238, 2012.
[20] G.L. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, C.C.C. Cheung, D.-U. Lee, R.C.C. Cheung,
and W. Luk. Reconﬁgurable acceleration for Monte-Carlo based ﬁnancial simulation. In Proc.
IEEE Int. Conf. on Field-Programmable Technology, pages 215–224, 2005.
[21] D.B. Thomas, J.A. Bower, and W. Luk. Automatic generation and optimisation of reconﬁg-
urable ﬁnancial Monte-Carlo simulations. In Proc. Int. Conf. on Application-Speciﬁc Systems,
Architectures and Processors, pages 685–689, 2007.
[22] Stephen Weston, J T Marin, James Spooner, Oliver Pell, and Oskar Mencer. Accelerating the
computation of portfolios of tranched credit derivatives. In IEEE Workshop on High Perfor-
mance Computational Finance, pages 1–8, 2010.
[23] NVIDIA. Kepler GK110 whitepaper, 2012.
[24] A. Lee, C. Yau, M. B. Giles, A. Doucet, and C. C. Holmes. On the utility of graphics cards to
perform massively parallel simulation of advanced Monte Carlo methods. Journal of Compu-
tational and Graphical Statistics, pages 769 – 789, 2009.
[25] Qiwei Jin, David Thomas, Diwei Dong, Wayne Luk, Anson H.T. Tse, Gary C.T. Chow, and
Stephen Weston. Multi-level customisation framework for curve based Monte Carlo ﬁnancial
simulations. In Proc. Int. Workshop on Reconﬁgurable Computing: Architectures, Tools and
Applications, 2012.
[26] Qiwei Jin, D.B. Thomas, and W. Luk. Exploring reconﬁgurable architectures for explicit ﬁ-
nite difference option pricing models. In Proc. Int. Conf. on Field Programmable Logic and
Applications, pages 73–78, 2009.
[27] Qiwei Jin, W. Luk, and D.B. Thomas. Unifying ﬁnite difference option-pricing for hardware
acceleration. In Int. Conf. on Field Programmable Logic and Applications, pages 6 –9, 2011.
154 BIBLIOGRAPHY
[28] Tobias Becker, Qiwei Jin, Wayne Luk, and Stephen Weston. Dynamic constant reconﬁguration
for explicit ﬁnite difference option pricing. In Proc. Int. Conf. on ReConFigurable Computing
and FPGAs, pages 176–181, 2011.
[29] Qiwei Jin, T. Becker, W. Luk, and D. Thomas. Optimising explicit ﬁnite difference option
pricing for dynamic constant reconﬁguration. In Int. Conf. on Field Programmable Logic and
Applications, pages 165 –172, 2012.
[30] J.C. Hull. Options, Futures and Other Derivatives. Prentice Hall, 6th edition, 2005.
[31] F. Black and M.S. Scholes. The pricing of options and corporate liabilities. Journal of Political
Economics, 81:637–659, 1973.
[32] A. G. Z. Kemna and A. C. F. Vorst. A pricing method for options based on average asset values.
Journal of Banking & Finance, 14(1):113–129, 1990.
[33] Thomas S Y Ho and Sang-bin Lee. Term structure movements and pricing interest rate contin-
gent claims. Journal of Finance, 41(5):1011–29, December 1986.
[34] Oldrich Vasicek. An equilibrium characterization of the term structure. Journal of Financial
Economics, 5(2):177–188, November 1977.
[35] David Heath, Robert Jarrow, and Andrew Morton. Bond pricing and the term structure of
interest rates: A new methodology for contingent claims valuation. Econometrica, 60(1):77–
105, January 1992.
[36] Roberto Reno and Adamo Uboldi. PCA based calibration of an HJM model. Quaderno IAC,
Q28-002, 2002.
[37] Oren Cheyette. Term structure dynamics and mortgage valuation. Journal of Fixed Income,
1:28–41, 1992.
[38] Peter Ritchken and L. Sankarasubramanian. Volatility structures of forward rates and the dy-
namics of the term structure. Mathematical Finance, 5(1):55–72, 1995.
[39] Paul Glasserman. Monte Carlo Methods in Financial Engineering. Springer, 1 edition, 2003.
BIBLIOGRAPHY 155
[40] John C. Cox, Stephen A. Ross, and Mark Rubinstein. Option pricing: A simpliﬁed approach.
Journal of Financial Economics, 7(3):229–263, 1979.
[41] Phelim P. Boyle. Option valuation using a three jump process. International Options Journal,
3:7–12, 1986.
[42] George M. Jabbour, Marat V. Kramin, Timur V. Kramin, and Stephen D. Young. Multino-
mial lattices and derivatives pricing. In Advances in Quantitative Analysis of Finance And
Accounting. World Scientiﬁc Publishing, 2005.
[43] Qiwei Jin, David B. Thomas, Wayne Luk, and Benjamin Cope. Exploring reconﬁgurable
architectures for tree-based option pricing models. ACM Trans. Reconﬁgurable Technol. Syst.,
2:21:1–21:17, September 2009.
[44] Narayan Ganesan, Roger D. Chamberlain, and Jeremy Buhler. Acceleration of binomial op-
tions pricing via parallelizing along time-axis on GPU. In Symp. on Application Accelerators
in High Performance Computing, 2009.
[45] Chris Wynnyk and Malik Magdon-Ismail. Pricing the American option using reconﬁgurable
hardware. In Intl. Conf. on Computational Science and Engineering, pages 532 – 536, 2009.
[46] Ari D. Andricopoulos, Martin Widdicks, Peter W. Duck, and David P. Newton. Universal
option valuation using quadrature methods. Journal of Financial Economics, 67(3):447–471,
2003.
[47] A.H.T. Tse, D. Thomas, and W. Luk. Design exploration of quadrature methods in option
pricing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 20(5):818 –826,
may 2012.
[48] Anson H.T. Tse, Gary C.T. Chow, Qiwei Jin, David Thomas, and Wayne Luk. Optimising
performance of quadrature methods with reduced precision. In Proc. Int. Workshop on Recon-
ﬁgurable Computing: Architectures, Tools and Applications, 2012.
[49] Peter Carr and Dilip B. Madan. Option valuation using the fast fourier transform. Journal of
Computational Finance, 2:61–73, 1999.
156 BIBLIOGRAPHY
[50] Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Pu¨schel. Computer genera-
tion of hardware for linear digital signal processing transforms. ACM Transactions on Design
Automation of Electronic Systems, 17(2), 2012.
[51] L Elden. Numerical solution of the sideways heat equation by difference approximation in
time. Inverse Problems, 11(4):913–923, 1995.
[52] Liping Gao, Bo Zhang, and Dong Liang. The splitting ﬁnite-difference time-domain methods
for Maxwell’s equations in two dimensions. J. Comput. Appl. Math., 205(1):207–230, 2007.
[53] Daniel J. Duffy. Robust and accurate ﬁnite difference methods in option pricing one factor
models. working report Datasim, 2001.
[54] Jean E. Vuillemin, Patrice Bertin, Didier Roncin, Mark Shand, Herve´ H. Touati, and Philippe
Boucard. Programmable active memories: reconﬁgurable systems come of age. IEEE Trans.
on VLSI, 4(1):56–69, 1996.
[55] E. Motuk, R. Woods, and S. Bilbao. Implementation of ﬁnite difference schemes for the wave
equation on FPGA. In Proc. Int. Conf. on ASSP, pages 237–240, 2005.
[56] G.W. Morris and M. Aubury. Design space exploration of the European option benchmark
using Hyperstreams. In Proc. Int. Conf. on Field Programmable Logic and Applications, pages
5–10, 2007.
[57] Anson H.T. Tse, David B. Thomas, K.H. Tsoi, and Wayne Luk. Reconﬁgurable control variate
Monte-Carlo designs for pricing exotic options. In Proc. Int. Conf. on Field Programmable
Logic and Applications, pages 364–367, 2010.
[58] Xiang Tian and Khaled Benkrid. American option pricing on reconﬁgurable hardware using
Least-Squares Monte Carlo method. In Proc. Int. Conf. on Field-Programmable Technology,
pages 263 –270, 2009.
[59] Xiang Tian and C. Bouganis. A run-time adaptive FPGA architecture for Monte Carlo simu-
lations. In Proc. Int Conf. On Field Programmable Logic and Applications, pages 116 –122,
2011.
BIBLIOGRAPHY 157
[60] Gary Chun Tak Chow, Anson Hong Tak Tse, Qiwei Jin, Wayne Luk, Philip H.W. Leong, and
David B. Thomas. A mixed precision Monte Carlo methodology for reconﬁgurable accelerator
systems. In Proc. Int. Symp on Field Programmable Gate Arrays, pages 57–66, 2012.
[61] Daniel J. Duffy. Finite Difference Methods in Financial Engineering. Wiley, 2006.
[62] George S. Fishman. Monte Carlo Concepts, Algorithms, and Applications. Springer, 1995.
[63] L. Le Cam. The central limit theorem around 1935. Statistical Science, 1:78–91, 1986.
[64] Richard J. Larsen and Morris L. Marx. An Introduction to Mathematical Statistics and Its
Applications. Pearson Education, 2011.
[65] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C.
Phillips. GPU computing. Proc. of the IEEE, 96(5):879–899, 2008.
[66] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A uniﬁed graphics
and computing architecture. IEEE Micro, 28(2):39 –55, 2008.
[67] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming
with CUDA. Queue, 6(2):40–53, 2008.
[68] Khronos group. http://en.wikipedia.org/wiki/Khronos_Group, 2012.
[69] J.E. Stone, D. Gohara, and Guochun Shi. OpenCL: A parallel programming standard for het-
erogeneous computing systems. Computing in Science Engineering, 12(3):66 –73, 2010.
[70] Kamran Karimi, Neil G. Dickson, and Firas Hamze. A performance comparison of cuda and
opencl. CoRR, 2010.
[71] Jianbin Fang, A.L. Varbanescu, and H. Sips. A comprehensive performance comparison of
CUDA and OpenCL. In Proc. Int. Conf. on Parallel Processing (ICPP), pages 216 –225, 2011.
[72] George A. Constantinides. Word-length optimization for differentiable nonlinear systems.
ACM Trans. on Design Automation of Elect. Sys., 11(1):26–43, 2006.
158 BIBLIOGRAPHY
[73] Ki-Il Kum and Wonyong Sung. Combined word-length optimization and high-level synthesis
of digital signal processing systems. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 20:921 –930, 2001.
[74] Claire F. Fang, Rob A. Rutenbar, and Tsuhan Chen. Fast, accurate static analysis for ﬁxed-
point ﬁnite-precision effects in DSP designs. In Proc. Int. Conf. on Computer-aided design,
2003.
[75] Altaf Abdul Gaffar, Oskar Mencer, Wayne Luk, and Peter Y. K. Cheung. Unifying bit-width
optimisation for ﬁxed-point and ﬂoating-point designs. In IEEE Symp. on Field-Programmable
Custom Computing Machines, 2004.
[76] D.-U. Lee, A.A. Gaffar, R.C.C. Cheung, O. Mencer, W. Luk, and G.A. Constantinides.
Accuracy-guaranteed bit-width optimization. Computer-Aided Design of Integrated Circuits
and Systems, IEEE Transactions on, 25:1990 –2000, 2006.
[77] D. Boland and G.A. Constantinides. Automated precision analysis: A polynomial algebraic
approach. In IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM),
2010.
[78] G.C.T. Chow, K.W. Kwok, W. Luk, and P. Leong. Mixed precision processing in reconﬁgurable
systems. In Proc. Int. Symp. on Field-Programmable Custom Computing Machines, pages 17
–24, 2011.
[79] Maya Gokhale, Janette Frigo, Christine Ahrens, Justin Tripp, and Ron Minnich. Monte Carlo
radiative heat transfer simulation on a reconﬁgurable computer. In Proc. Int. Conf. on Field
Programmable Logic and Applications, 2004.
[80] Alexander Kaganov, Asif Lakhany, and Paul Chow. FPGA acceleration of multifactor CDO
pricing. ACM Trans. Reconﬁgurable Technol. Syst., 4:20:1–20:17, 2011.
[81] Akila Gothandaraman, Gregory D. Peterson, G. L. Warren, Robert J. Hinde, and Robert J.
Harrison. FPGA acceleration of a quantum Monte Carlo application. Parallel Computing,
34(4-5):278–291, 2008.
BIBLIOGRAPHY 159
[82] Ramon Edgar Moore. Interval arithmetic and automatic error analysis in digital computing.
PhD thesis, Stanford University, 1963.
[83] C.F. Fang, Tsuhan Chen, and R.A. Rutenbar. Floating-point error analysis based on afﬁne
arithmetic. In Proc. Int. Cont. on Acoustics, Speech, and Signal Processing, volume 2, pages
II – 561–4 vol.2, 2003.
[84] Xiang Tian and K. Benkrid. Design and implementation of a high performance ﬁnancial
Monte-Carlo simulation engine on an FPGA supercomputer. In Proc. Int. Conf on FPT, pages
81 –88, 2008.
[85] Donald E. Knuth. The Art of Computer Programming, Seminumerical algorithms. Addison-
Wesley, 1997.
[86] Y. Iskander, S. Craven, A. Chandrasekharan, S. Rajagopalan, G. Subbarayan, T. Frangieh, and
C. Patterson. Using partial reconﬁguration and high-level models to accelerate FPGA design
validation. In Proc. Int. Conf. on Field-Programmable Technology, pages 341 –344, 2010.
[87] Federico Nava, Donatella Sciuto, Marco Domenico Santambrogio, Stefan Herbrechtsmeier,
Mario Porrmann, Ulf Witkowski, and Ulrich Rueckert. Applying dynamic reconﬁguration
in the mobile robotics domain: A case study on computer vision algorithms. ACM Trans.
Reconﬁgurable Technol. Syst., 4:29:1–29:22, 2011.
[88] Dirk Koch and Jim Torresen. FPGASort: a high performance sorting architecture exploiting
run-time reconﬁguration on FPGAs for large problem sorting. In Proc. Int. Symp. on Field
programmable gate arrays, pages 45–54, 2011.
[89] K.M.G. Purna and D. Bhatia. Temporal partitioning and scheduling data ﬂow graphs for re-
conﬁgurable computers. Computers, IEEE Transactions on, 48:579 –590, 1999.
[90] M. Kaul and R. Vemuri. Optimal temporal partitioning and synthesis for reconﬁgurable archi-
tectures. In In Proc. Design, Automation and Test in Europe, pages 389 –396, 1998.
160 BIBLIOGRAPHY
[91] K. Bruneel, F. Abouelella, and D. Stroobandt. Automatically mapping applications to a self-
reconﬁguring platform. In Proc Int. Conf on Design, Automation and Test in Europe, pages
964 –969, 2009.
[92] T. Becker, W. Luk, and P.Y.K. Cheung. Energy-aware optimisation for run-time reconﬁgura-
tion. In Proc. Int. Symp. on Field-Programmable Custom Computing Machines, pages 55–62,
2010.
[93] Celoxica. Handel-C language reference manual.
[94] Celoxica. DSM manual.
[95] Celoxica. Hyperstreams user manual.
[96] Maxeler. MaxCompiler Kernel Compiler Tutorial, 2011.
[97] D. Greaves and S. Singh. Kiwi: Synthesis of FPGA circuits from parallel programs. In Proc.
Int. Symp. on Field-Programmable Custom Computing Machines, pages 3 –12, 2008.
[98] David B. Thomas and Wayne W. Luk. Framework for development and distribution of hard-
ware acceleration. Reconﬁgurable Technology: FPGAs and Reconﬁgurable Processors for
Computing and Communications, 4867(1):60–70, 2002.
[99] Geoff Coulson, Gordon S. Blair, Michael Clarke, and Nikos Parlavantzas. The design of a
conﬁgurable and reconﬁgurable middleware platform. Distrib. Comput., 15:109–126, 2002.
[100] J.G.F. Coutinho, J. Jiang, and W. Luk. Interleaving behavioral and cycle-accurate descriptions
for reconﬁgurable hardware compilation. In Proc. IEEE Symp. on Field-Programmable Custom
Computing Machines, pages 245–254, 2005.
[101] A.L. Slade, B.E. Nelson, and B.L. Hutchings. Reconﬁgurable computing application frame-
works. In Proc. IEEE Symp. on Field-Programmable Custom Computing Machines, pages
251–260, 2003.
[102] W. Cescirio, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A.
Jerraya, and M. Diaz-Nava. Component-based design approach for multicore SoCs. In In
Proc. Design Automation Conference, pages 789 – 794, 2002.
BIBLIOGRAPHY 161
[103] Patrick Schaumont and Ingrid Verbauwhede. A component-based design environment for ESL
design. IEEE Des. Test, 23:338–347, 2006.
[104] ANTLR. http://www.antlr.org/.
[105] David B. Thomas and Wayne Luk. High quality uniform random number generation using
LUT optimised state-transition matrices. J. VLSI Signal Process. Syst., 47:77–92, 2007.
[106] D.B. Thomas and W.; Luk. Non-uniform random number generation through piecewise linear
approximations. In Proc. Int. Conf. on Field Programmable Logic and Applications, 2006.
[107] F. de Dinechin and B Pasca. Floating-point exponential functions for DSP-enabled FPGAs. In
Proc. Int. Conf. on Field-Programmable Technology, 2010.
[108] Laurent Fousse, Guillaume Hanrot, Vincent Lefe`vre, Patrick Pe´lissier, and Paul Zimmermann.
Mpfr: A multiple-precision binary ﬂoating-point library with correct rounding. ACM Trans.
Math. Softw., 33, 2007.
[109] Qiwei Jin, David B Thomas, Wayne Luk, and Benjamin Cope. Exploring reconﬁgurable archi-
tectures for binomial-tree pricing models. In Proc. Int. workshop on Applied Reconﬁgurable
Computing, pages 245–255, 2008.
[110] Anson H.T Tse, David B. Thomas, and Wayne Luk. Accelerating quadrature methods for
option valuation. In Proc. IEEE Symp. on Field-Programmable Custom Computing Machines,
2009.
[111] Robert Strzodka and Dominik Go¨ddeke. Pipelined mixed precision algorithms on FPGAs for
fast and accurate PDE solvers from low precision components. In Proc. IEEE Symp. on Field-
Programmable Custom Computing Machines, pages 259–268, 2006.
[112] Jan Vecer. Uniﬁed pricing of Asian options. Risk, 15(6):113–116, 2002.
[113] Casey Reardon, Eric Grobelny, Alan D. George, and Gongyu Wang. A simulation framework
for rapid analysis of reconﬁgurable computing systems. ACM Trans. Reconﬁgurable Technol.
Syst., 3:25:1–25:29, November 2010.
162 BIBLIOGRAPHY
[114] Qiwei Jin, Thomas David, and Wayne Luk. On comparing ﬁnancial option price solvers on
FPGA. In Proc. IEEE Symp. on Field-Programmable Custom Computing Machines, 2011.
[115] M. Giles and X. Su. Notes on using the NVIDIA 8800 GTX graphics card. Oxford University,
2007.
[116] NVIDIA. NVIDIA CUDA programming guide. NVIDIA website, 2008.
[117] Xinyu Niu, Qiwei Jin, W. Luk, Qiang Liu, and O. Pell. Exploiting run-time reconﬁguration in
stencil computation. In Proc. Int. Conf on Field Programmable Logic and Applications (FPL),
pages 173–180, 2012.
[118] M. J. Brennan and E. S. Schwartz. Finite difference methods and jump processes arising in
the pricing of contingent claims: A synthesis. Journal of Financial and Quantitative Analysis,
13(3):461–474, 1978.
[119] S. Singh, J. Hogg, and D. McAuley. Expressing dynamic reconﬁguration by partial evaluation.
In IEEE Symp. on Field-Programmable Custom Computing Machines, pages 188 –194, 1996.
[120] Scott McMillan and Cameron Patterson. JBits implementations of the advanced encryption
standard (rijndael). In In Proc. Int. Conf. on Field-Programmable Logic and Applications,
pages 162–171, 2001.
[121] Dan Benyamin, John Villasenor, and Wayne Luk. Optimizing FPGA-based vector product
designs. In In Proc. IEEE Symp. on Field-Programmable Custom Computing Machines, pages
188–197, 1999.
[122] Charles M. Grinstead and J. Laurie Snell. Introduction to probability. American Mathematical
Society, 1997.
[123] Florent De Dinechin, Jeremie Detrey, Cret Octavian, and Radu Tudoran. When FPGAs are
better at ﬂoating-point than microprocessors. Laboratoire de lInformatique du Paralllisme re-
search report RR2007-40, 2007.
BIBLIOGRAPHY 163
[124] C. Claus, B. Zhang, W. Stechele, L. Braun, M. Hubner, and J. Becker. A multi-platform
controller allowing for maximum dynamic partial reconﬁguration throughput. In Proc. Int.
Conf. on Field Programmable Logic and Applications, pages 535–538, 2008.
[125] Carbon Trust. Conversion factors, 2011.
[126] Grigorios Mingas, Farhan Rahman, and Christos-Savvas Bouganis. On optimizing the arith-
metic precision of MCMC algorithms. In Proc. Int. Symp. on Field-Programmable Custom
Computing Machines, pages 181–188, 2013.
[127] Huabin Ruan, Xiaomeng Huang, Haohuan Fu, Guangwen Yang, W. Luk, S. Racaniere, O. Pell,
and Wenjing Han. An FPGA-based data ﬂow engine for Gaussian Copula Model. In Proc. Int.
Symp. on Field-Programmable Custom Computing Machines, pages 218–225, 2013.
[128] Lin Gan, Haohuan Fu, Wayne Luk, Chao Yang, Wei Xue, Xiaomeng Huang, Youhui Zhang,
and Guangwen Yang. Accelerating solvers for global atmospheric equations through mixed-
precision data ﬂow engine. In Proc. Int. Conf. on Field Programmable Logic and Applications,
pages 1–6, 2013.
[129] Xinyu Niu, Qiwei Jin, Thomas Chau, Wayne Luk, and Qiang Liu. Automating resource op-
timisation in reconﬁgurable design. In Proc. Int. Symp on Field Programmable Gate Arrays,
2012.
[130] P.D.M. Causon and P.C.G. Mingham. Introductory Finite Difference Methods for PDEs. Book-
boon, 2010.
[131] J. Crank and P. Nicolson. A practical method for numerical evaluation of solutions of partial
differential equations of the heat-conduction type. Mathematical Proceedings of the Cam-
bridge Philosophical Society, 43:50–67, 1947.
[132] Yao Zhang, Jonathan Cohen, and John D. Owens. Fast tridiagonal solvers on the gpu. In Proc.
SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 127–136, 2010.
164 BIBLIOGRAPHY
[133] David Warne, Neil A. Kelson, and Ross F. Hayward. Solving tri-diagonal linear systems using
ﬁeld programmable gate arrays. In Proc. Int. Conf. on Computational Methods, November
2012.
[134] T. Bjrk, A. Szepessy, R. Tempone, and G.E. Zouraris. Monte Carlo Euler approximations of
HJM term structure ﬁnancial models. BIT Numerical Mathematics, pages 1–43, 2012.
[135] R. L. McGreevy and L. Pusztai. Reverse Monte Carlo simulation: A new technique for the
determination of disordered structures. Molecular Simulation, 1(6):359–367, 1988.
[136] C.E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. Advances in neural information
processing systems, 15:489–496, 2003.
[137] A. Mu¨cke, R. Engel, JP Rachen, R.J. Protheroe, and T. Stanev. Monte-Carlo simulations of
photohadronic processes in astrophysics. Computer Physics Communications, 124:290–314,
2000.
[138] S. Galanti and A. Jung. Low-discrepancy sequences. The Journal of Derivatives, 5(1):63–83,
1997.
[139] D.J. Aldous and J. Pitman. Brownian bridge asymptotics for random mappings. Random
Structures & Algorithms, 5(4):487–512, 1994.
[140] J. Hull and A. White. The use of the control variate technique in option pricing. Journal of
Financial and Quantitative Analysis, 23(3):237–251, 1988.
[141] JM Hammersley and KW Morton. A new Monte Carlo technique: antithetic variates. Mathe-
matical Proceedings of the Cambridge Philosophical Society, 52(03):449–475, 1956.
[142] JM Hammersley. Conditional Monte Carlo. Journal of the ACM, 3:73–76, 1956.
[143] N.A. Woods and T. VanCourt. FPGA acceleration of quasi-Monte Carlo in ﬁnance. In Proc.
Int. Conf. on Field Programmable Logic and Applications, pages 335–340, 2008.
[144] Paul Willmot. Paul Willmot on Quantitative Finance. Wiley, 2 edition, 2006.
[145] Bruno Dupire. Pricing with a smile. Risk, pages 18–20, 1994.
BIBLIOGRAPHY 165
[146] Jesper Andreasen and Brian Norsk Huge. Volatility interpolation. Risk, pages 86–89, 2011.
[147] C.H. Reinsch. Smoothing by spline functions. Numerische Mathematik, 10(3):177–183, 1967.
[148] R. J. Renka. Interpolatory tension splines with automatic selection of tension factors. Soci-
ety for Industrial and Applied Mathematics, Journal for Scientiﬁc and Statistical Computing,
8:393–415, 1987.
