Abstract-Quadrature based methods for numerical integration provide a means of quickly and accurately pricing financial products such as options. These methods can be applied to multidimensional products, such as options on multiple underlying assets, but suffer from an exponential increase in computational complexity as the dimension increases. This paper examines the theoretical complexity of quadrature methods for pricing multi-dimensional options, and then relates this to practical performance in contemporary hardware. An automated system for generating hardware architectures for quadrature is used to explore the performance of increasing dimensionality in FPGA implementations, and then compared them to GPU and CPU solutions. We find that a single-precision FPGA can provide 25 times speedup over software in three dimensions, and offers slightly improved performance over a GPU using comparable technology. The latest GPUs are 2.7 times faster than the older technology Virtex-4 FPGA, but the FPGA still provides over 9 times the energy efficiency.
I. INTRODUCTION
Financial services companies continually invent new ways to repackage and modify financial products in order to satisfy the needs of different investors. Many financial derivatives now involve multiple underlying assets instead of just one. As the computation complexity increases exponentially with the number of underlying assets (i.e. the number of dimensions), finding ways to accelerate the option pricing computation becomes a significant challenge.
Quadrature methods are a powerful technique for options pricing, especially in situations where a closed-form solution does not exist. Quadrature methods can be used to price many types of options and are very efficient for pricing pathdependent options.
A single option with one underlying asset priced using quadrature methods can typically be performed in milliseconds on desktop computers. However, the computation time of quadrature methods increases exponentially with the number of dimensions, which is also known as "the curse of dimensionality". This paper shows how to accelerate quadrature computation with multiple dimensions using Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). The major contributions of this paper are:
• An abstract model of computational complexity for option pricing using quadrature methods, parameterised by dimension and number of time-steps (Section III).
• An automated system for generating hardware architectures and implementation parameterised for a given set of dimensions (Section IV).
• Results showing the performance of designs from our approach, including comparisons with GPU and CPU implementations (Section V).
II. RELATED WORK
Numerical techniques have been developed to value complex financial products where a closed-form solution does not exist. These techniques include lattice (binomial and trinomial trees), finite-difference, Monte Carlo and quadrature methods. Among the various methods for hardware acceleration of financial simulation, Monte Carlo methods have been the focus in the past few years.
Some recent studies have shifted the focus to pipelined tree based methods [1] , finite-difference methods [2] and quadrature methods [3] . All of these studies have provided solutions for the pricing of certain options (such as American options) which cannot be handled easily by Monte Carlo methods. However, none of the above addresses the issue of pricing options with multiple underlying assets. Reference [3] demonstrates hardware accelerated quadrature option pricing methods. This paper significantly extends it to cover the multiple underlying assets problem and provides the complexity model with the number of dimensions.
III. MODEL ANALAYSIS

A. Option Pricing and Quadrature Methods
To understand option pricing with multiple underlying assets using quadrature methods, we first consider the Black Scholes partial differential equation [4] for an option with all underlying assets following geometric Brownian motion: at t and y i = log(S i ) to be the chosen nodes at t + Δt. Let R be the matrix such that R(i, j) = ρ ij . The solution is:
Eq. (2) (y 1 , . . . , y d , t) for the calculation in the previous time interval. The current option price is therefore calculated iteratively backward from the end of the time. Fig. 1 shows the graphical representation of the iteration process. We define n as the number of possible values (grid points) for y 1 and assume the number of grid points for all y i is the same, and we define m as the number of time 
B. Complexity Analysis
From Eq. (4), the evaluation of E involves the matrix multiplication of column vector α. The calculation of all α i from Eq. (5) can be optimized as:
C1 i and C2 i can be precomputed in a pre-processing stage as the values of all C1 i and C2 i are kept constant throughout the quadrature evaluation. Table I shows the summary of computation complexity for some example options.
The computation time can be estimated by assuming that a 10 GFLOPs CPU is used (the peak performance of a Pentium 4 3.2GHz CPU is around 6.4 GFLOPs) and it takes the same time for all different floating point operators. Fig. 2 shows that the computation time required is drastically increased with the number of dimensions. The computation time required for a European option with 7 underlying assets is 14.7 days using a desktop computer with peak performance.
IV. DESIGN AND IMPLEMENTATION
A. System architecture
The option evaluation system architecture is not designed for pricing a specific type of option. The implementation to support various options with any number of dimensions is considered. The system architecture of the generic option valuation system using quadrature method is shown in Fig. 3 . The architecture consists of the following components: (a) a pre-processing block, (b) QUAD evaluation core, (c) a postprocessing block, and (d) a main control unit.
Option parameter inputs for the system include T, S0 1 
, option-type, optionspecific-parameters and dimension. The option-type and option-specific-parameters provide the flexibility to support the pricing of other options. For example, we could specify the number of periods (m) and the knock-out/ knock-in prices (b) for Barrier options. We could also specify the grid density factor or grid size factor to select the corresponding accuracy.
The most challenging part of the multi-dimensional design is how it supports an arbitrary number of dimensions. For multiple assets design, the hardware evaluation cores are completely different for different dimensions as the underlying logic and the number of pipeline stages are different. As a result, all the possible hardware designs for all possible dimensions are implemented. The resultant bitstream for each dimension is prepared and stored in the system. The hardware evaluation core has to be reconfigured with the bitstream for the corresponding dimension.
B. Pipelined Design
The hardware design of the QUAD evaluation core can be divided into three major parts. The first part is the evaluation of the vector α from Eq. (6). The second part is the matrix multiplication of α T R −1 α. The last part is the rest of the integration. HyperStreams and the Handel-C programming language are used for the FPGA implementation of the QUAD evaluation core. HyperStreams is a high-level abstraction language and library [6] . It produces a fully-pipelined hardware implementation with automatic optimization of operator latency at compile time. Fig. 4 shows the diagram of α T R −1 α calculation inside a 2D QUAD evaluation core. The grey boxes denote the pipeline balancing registers that are allocated automatically by HyperStreams. As the evaluation is pipelined, we can have a partial integral value for every clock cycle. For an FPGA running at 100MHz, the QUAD evaluation core can produce 100M partial integral values per second.
C. Design Automation
Implementing the evaluation core for each dimension manually would be tedious. As a result, an automatic evaluation core generator is designed. The generator accepts 2 input parameters: number of dimensions and precision (single or double). An operator tree is generated and stored in a temporary file. Finally, the operator tree file is parsed and the corresponding HyperStreams and Handel-C codes are generated. The Handel-C code could be compiled for simulation or bit-file generation. Table II shows the FPGA device utilization figures for xc4vlx160 devices of the QUAD evaluation core in different dimensions and precisions. The result indicates that the FPGA device is fully occupied for 2 dimensions under doubleprecision, and is fully occupied for 6 dimensions under singleprecision.
V. EVALUATION AND COMPARISON
In this section, the performance of different implementations of QUAD evaluation core is studied. We choose the pricing of 1,000 European options as the performance benchmark. The performance analysis for pricing 3-underlying assets European options is studied. Simpson's rule is preferable to the trapezoidal rule in our system as the error terms of Simpson's rule decrease at a rate of (δy) 4 which produces more accurate results with the same hardware complexity. The performance of both 32-bit single precision and 64-bit double precision is investigated. The acceleration of FPGA implementation and GPU implementations are compared with software implementation on a PC as a reference. The reference PC is a 3.6GHz Pentium 4 processor with 1GB RAM and Ubuntu 8.04.1 operating system. The software implementation is written in C language and compiled with maximum speed optimization options.
The targeted FPGA is Xilinx Virtex-4 xc4vlx160. The designs are compiled using DK5.1 and Xilinx ISE 10.1. The targeted GPU is nVidia Geforce 8600GT with 256MB of on board RAM and nVidia Tesla C1060 with 4GB of on board RAM. The summary of the performance comparison of 3-dimensional QUAD evaluation core is shown in Table III .
From the result, it can be seen that the FPGA implementation on xc4vlx160 achieved 24.9 times acceleration. For the performance of GPUs, a speedup of 22.7 times is achieved by Geforce 8600GT. A speedup of 163.7 times is achieved by Tesla C1060 under single-precision. In double-precision, Tesla C1060 has shown a 67.2 times speedup over the reference PC. There is no support of double-precision for Geforce 8600GT.
It can be seen that xc4vlx160 outperforms Geforece 8600GT by 9.7%. It is not surprising that Tesla C1060 GPU outperforms xc4vlx160 by 6.6 times, as Tesla C1060 is the latest GPU based on 65nm technology while Virtex-4 xc4vlx160 is an older model of FPGA based on 90nm technology. Future work will compare Tesla C1060 with the latest FPGAs such as Virtex-5 from Xilinx or Stratix IV from Altera.
It is interesting that the xc4vlx160 FPGA demonstrates better performance than the GPUs when energy consumption is taken into account. Xc4vlx160 can compute around 9 times more options than Geforce 8600GT and Tesla C1060 per unit of energy usage. Further performance improvement for FPGAs can be achieved using more up-to-date devices such as Virtex-5. As Virtex-5 has 4 times more slices than Virtex-4 and with higher clock frequency, potentially 4 times more speedup can be achieved without further optimization.
VI. CONCLUSION
This paper explores hardware accelerated option pricing models based on quadrature methods in multiple dimensions. An abstract model of computational complexity for option pricing using quadrature methods, parameterised by dimension and number of time-steps is developed. We implement the design for different dimensions in both FPGA and GPUs. An automated system for generating hardware architectures and implementation parameterised for a given set of dimensions is developed.
The performance of different implementations in higher dimensions is compared. Our implementation on FPGA can generally run 24.9 times faster than a Pentium 4 3.6GHz processor, 10% times faster than a GPU in comparable technology and 2.7 times slower than the latest GPU for 3-dimensional option pricing. From the energy consumption perspective, the FPGA is up to 9 times more efficient than GPUs and 575 times more efficient than the CPU.
Current and future work includes extending our work to cover the development of optimized hardware designs based on quadrature methods for a wide variety of applications, including modeling credit risk.
