Abstract
Introduction
The financial derivatives trade sees constant innovation and development, with new types of options introduced each year, offering increasingly sophisticated features and complex settlement terms. Although the basic European option can be priced with a closed-form solution, many other derivatives with knock-out/knock-in features (e.g. Accumulator, Decumulator, and Barrier Options), changing strike prices, or discrete settlement days, cannot be priced easily.
Numerical techniques have been developed to value these complex derivative products accurately and efficiently. These techniques include lattice (binomial and trinomial trees), finite-difference, Monte Carlo and quadrature methods.
Quadrature methods have been applied in different areas including modeling credit risk [1] , solving electromagnetic problems [2] and calculating photon distribution [3] . It is a powerful way of pricing path-dependent options where the path is monitored in discrete time. A lookback discrete barrier option priced using quadrature methods is more than 1000 times faster than using the trinomial method, while achieving a more accurate result [4] .
Using quadrature methods to price a single option is fast and can typically be performed in milliseconds on desktop computers. However, quadrature methods can become a computational bottleneck when a huge number of options are being revalued in real-time using live data-feeds. This paper shows how to accelerate the quadrature computation using Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). The main contributions of this paper are:
• A parallel hardware architecture for option pricing based on quadrature methods (Section 4).
• Implementation of quadrature pricing on FPGA and GPUs (Section 5).
• Evaluation and comparison of the FPGA and GPU implementations against each other, and against a software implementation (Section 6).
Related work
Previous work on hardware acceleration of financial simulation has been focused mainly on Monte Carlo methods. A pipelined datapath architecture and an on-chip instruction processor have been reported for speeding up the Brace, Gatarek and Musiela (BGM) interest rate model for pricing interest rate derivatives [5] . An automated methodology has been developed which produces optimized pipelined designs with thread-level parallelism based on high-level mathematical descriptions of financial simulation [6] . A stream-oriented FPGA-based accelerator with higher performance than GPUs and Cell processors has been proposed for evaluating European options [7] . However, the computational cost of the Monte Carlo approach is high, as in order to improve the accuracy of the result by a factor of n, the sample size has to be multiplied by n 2 times.
A pipelined hardware architecture has been developed for the binomial and trinomial option models [8] . However, lattice and finite-difference methods contain two main types of error: "distribution error" and "non-linearity error". Distribution error occurs because a continuous lognormal distribution is approximated by a discrete distribution. Non-linearity error occurs because the lattice or grid cannot cater for non-linearity in option price for certain values of the underlying asset. Non-linearity in option pricing is frequent for exotic options: for example in a discrete barrier option, at every barrier there is a non-linearity in the option price.
A numerical approach for option pricing based on quadrature methods has been proposed, which overcomes these errors [4] , and demonstrates accurate and fast calculation. The paper illustrates the effectiveness of quadrature methods in pricing path-dependent options, e.g. discrete moving barrier options, multiply compounded options, Bermudan options and American call options with changing strike price. Our approach involves repeated evaluations of a complex function which can be mapped to hardware efficiently in a fully pipelined and parallel architecture.
Option pricing and quadrature methods
To understand the option pricing with quadrature methods, we first consider the Black and Scholes partial differential equation [9] for an option with underlying asset following geometric Brownian motion:
where V (S,t) denotes the price of the option, S denotes the value of the underlying asset, t denotes time, T denotes the time to maturity, r denotes risk-free interest rate, σ denotes volatility of the underlying asset, E denotes exercise price and D c denotes continuous dividend yield.
The following standard transformations
give us the solution of V (x,t) as:
where
Eq. (2) contains an integral which cannot be evaluated analytically. Although for European options they can be converted to the probability density function for normal distribution, for more complicated options numerical techniques are required to evaluate the integrals. For the evaluation of other complicated options such as discrete barrier options and American options, the valuation problem can be arranged to exploit consecutive time intervals and apply Eq. (2) iteratively.
There are many different methods of numerical integral evaluation. Two of the most common methods include the trapezoidal rule and Simpson's rule [10] :
Trapezoidal rule: The trapezoidal rule is the simplest quadrature method but is the slowest to converge. It converges at a rate of (δ y) 2 . The approximation equation is:
Simpson's rule: The Simpson's rule is the most popular method for approximating integrals. It converges at a rate of (δ y) 4 which means a doubling of the number of steps reduces the error by a factor of sixteen. The approximation equation is:
Parallel Architecture
Our system architecture is not designed for pricing a specific option, so the implementation of all kinds of options must be considered. It has been shown that most of the equity options can be expressed in integral forms and solved by quadrature methods including: European options, discrete barrier options, moving discrete barrier options, Bermudan put options, American call options and lookback options [4, 11] . However, the quadrature evaluation procedures are slightly different for different types of options. Different types of options have different discontinuities, which lead to different integral boundaries. Some options contain option specific parameters: for example, the knock-out prices and number of periods are required for discrete barrier options. Although European options can be priced with a single quadrature step, most of the other options need to be evaluated iteratively from the price in period of m to m − 1. Therefore using different number of quadrature steps is required. Table 1 shows some of the pricing equations for different option types as illustration. We can see from the table that the pricing equations are different in terms of the integration range and the operation flow. For discrete barrier Option type: options, the result of C m+1 is required for the evaluation of C m . The final option value C 0 has to be evaluated iteratively. Table 2 shows the computation complexity for different types of options. The computation complexity depends on the evaluation flow, the number of integration grid points N and the number of time steps m.
System architecture
An interesting finding in Table 1 and Table 2 is that all option pricing equations require the evaluation of a similar integral on B(x, y)V (y,t + Δt). Using quadrature methods requires the evaluation of the function B(x, y) intensively, which is the computation bottleneck. Although function A(x) is also required to be evaluated repeatedly, the computation complexity for A(x) is much lower compared to B(x, y) with an order of N. As a result, our hardware acceleration is focused on the effective evaluation of the integral and the flexibility for a general option pricing framework, as illustrated in Figure 1 .
The system architecture of the generic option valuation system using quadrature method is shown in Figure 1 . The architecture consists of the following components: (a) a pre-processing block, (b) one or more QUAD evaluation core(s), (c) a post-processing block, and (d) a main control unit. Data input for the system is, K1, K2, T, So, E, r, D c , σ , option-type and option-specific-parameters. The optiontype and option-specific-parameters provide the flexibility to support the pricing of other options. For example, we could specify the number of periods (m) and the knockout/knock-in prices (b) for barrier options.
The parameter K1 denotes the grid density factor. As the underlying asset follows a lognormal distribution and the change of price exhibits Brownian motion, the value of y fluctuates proportional to √ Δt. As a result, K1 is defined as:
Therefore, increasing the value of K1 leads to a smaller value of δ y and a denser grid. The parameter K2 denotes the grid size factor. It is not possible to integrate a function from −∞ to +∞ numerically. Our quadrature method will evaluate from y min to y max where:
As a result, a large value of K2 leads to a large/small value of y max /y min and a wider grid. K2 can also be viewed as the number of standard deviations from y to the original position of x after Δt. 2) from the pre-processing block and output the result to the postprocessing block. The post-processing block combines the integral value with the value of A(x) and produces the value of V (x,t). The main control unit then decides whether V (x,t) is the final solution or a temporary result for the next iteration.
Architecture Style
The C-Slow approach [12] has been used in many FPGA based simulations and is particularly appropriate in this system in order to achieve high clock rates while still maintaining high throughput. C-Slow operation can be achieved by modelling multiple QUAD evaluation cores in parallel. We continuously provide the pipeline with parameters to evaluate other integral values while we are waiting for the results required for the current integral. The main controller manages the overall timing of the system and ensures that the intermediate values are stored and retrieved correctly.
The QUAD evaluation core is implemented in hardware for 3 main reasons. First, more than one QUAD evaluation core can be fitted on a single FPGA. Therefore, several quadratures can be evaluated simultaneously to exploit parallelism. Second, the evaluation of the function B(x, y) could be implemented in pipelined hardware which is fast and efficient. The value of B(x, y) can be obtained in every clock cycle. Third, as shown previously in Table 2 , the evaluation of the quadrature is the computation bottleneck, which would benefit from hardware acceleration.
The main control unit, pre-processing and postprocessing blocks are implemented in software for the following reasons. First, this increases the flexibility to support other options. Second, the evaluation in pre-processing and post-processing blocks is not the performance bottleneck. Implementing them on the hardware would not improve the performance significantly.
The proposed software-hardware architecture offers fast and parallel hardware evaluation cores for the repeated numerical integrations, and provides flexibility to a versatile option evaluation platform.
Optimising QUAD evaluation core
An effective implementation of the QUAD evaluation core is to create a tree of pipelined operators. Figure 3 shows an operator tree based on Eq. 2 to Eq. 7. In Figure 3 , Δt, x, y i , σ , r, D c ,V (y i ,t + Δt) and I i are fed to the evaluation tree continuously. I i denotes the integration coefficients in the quadrature method. Using the trapezoidal rule or Simpson's rule, I i will be in a sequence of 1, 2, 2,...,2, 1 or 1, 4, 2, 4, 2,...,2, 4, 1 respectively as the coefficients in Eq. 6 and Eq. 7. The accumulated value should be multiplied in post-processing with δ y 2 for trapezoidal rule, or with δ y 6 for Simpson's rule. Therefore, the type of quadrature rule can be specified outside the QUAD evaluation cores.
However, the straight forward implementation consumes a large amount of hardware resources as it requires many floating-point operators. The optimized design is shown in Figure 4 , and will be used to produce implementations on both FPGA (Section 5.1) and GPUs (Section 5.2). The optimized quadrature operator tree takes the following data input: x, y i ,C1,C2, I i and V (y i ,t + Δt). We define:
The operator tree is optimized by identifying the nonchanging nodes during the pipelined evaluation. The values of C1 and C2 are fixed for the values of y i , I i ,V ( y i ,t +Δt), i ∈ [0, N]. Therefore, C1 and C2 can be pre-computed in the pre-processing stage and passed to QUAD evaluation cores. The hardware size is therefore reduced significantly and the number of parameters is also reduced. The parameters of I i and V ( y i ,t + Δt) are passed to the QUAD evaluation cores together. For a grid integration with N steps, the total number of parameters required is of the order 2N, which is a 33% reduction from the original design which is of the order 3N. 
Implementation

QUAD evaluation core on FPGA
Our FPGA implementation of the QUAD evaluation cores is based on HyperStreams and the Handel-C programming language. HyperStreams is a high-level abstraction language and library [7] . It can produce a fully-pipelined hardware implementation with automatic optimization of operator latency at compile time. This feature is useful when implementing a complex algorithm core. Figure 5 shows the diagram of a fully-pipelined QUAD evaluation core based on the design in Figure 4 . The grey boxes denote the pipeline balancing registers that are allocated automatically by HyperStreams. The QUAD evaluation core therefore produces the value of B(x, y) in Eq. (2) for every clock cycle. From Figure 5 , we can see that it requires initially 7 clock cycles for the partial integral value to reach the accumulator. As the evaluation is pipelined, we can have a partial integral value for every clock cycle after. For an FPGA running at 100MHz, the QUAD evaluation core can produce 100M partial integral values per second.
Bandwidth is one of the major concerns in a hardware accelerated system. For an FPGA running at 100MHz, 100M partial integral values can be obtained per second which require a bandwidth of ≈800MBytes/second for input parameters. The Celoxica RCHTX-XV4 supports a high speed HTX interface for data transfer up to 3.2GBytes/second which is sufficient for the I/O requirement [13] . We also assume that the requests for quadrature evaluation are received concurrently. This assumption allows us to use high-latency pipelined functional units to achieve high clock rate while still achieving high throughput..
QUAD evaluation core on GPU
Graphics Processing Units (GPUs) have been used for acceleration in various applications [14] [15] . Our implementation on GPUs is based on Compute Unified Device Architecture (CUDA) API provided for nVidia GPUs [16] . A typical CUDA co-processing flow is shown in Figure 6 . Under the CUDA environment, a function can be compiled into a "kernel". The execution of CUDA is organized as a computation grid as shown in Figure 7 . Each computation grid consists of a grid of thread blocks. The "kernel" is executed by all threads in parallel. Each block and thread has a unique block ID and thread ID.
The QUAD evaluation core is implemented in CUDA to exploit parallelism. Similar to the implementation on FPGA, we implement the evaluation core in CUDA based on the optimized operator tree in Figure 4 . In addition, the 
Evaluation and comparison
In this section, the performance of different implementations of QUAD evaluation core is studied. We choose the pricing of European option with grid density factor K1 = 400, 000 and grid size factor K2 = 10. The typical K1 value of 400 produces highly accurate results, but the reason for choosing a much larger value is to facilitate performance analysis of the QUAD evaluation cores with a longer evaluation time. No matter what values of K1 or K2, the QUAD evaluation cores are still responsible for the computation bottleneck of the option pricing of the order N 2 m as shown in Table 2 . Simpson's rule is preferable to the trapezoidal rule in our system as the error terms of Simpson's rule decrease at a rate of (δ y) 4 which produces more accurate results with the same hardware complexity. Therefore, Simpson's Rule is adopted for the performance analysis. The performance of both 32-bit single precision and 64-bit double precision are investigated.
The accelerations of FPGA implementation and GPUs implementations are in comparison to a reference software implementation. The reference PC is a 3.6GHz Pentium 4 processor with 1GB RAM and Ubuntu 8.04.1 operating system. The software implementation is written using C language and compiled with maximum speed optimization options.
The targeted FPGA is Xilinx Virtex-4 xc4vlx160 as this is the part found in the RCHTX card. The designs are compiled using DK5.1 and Xilinx ISE 9.2. The targeted GPU is nVidia Geforce 8600GT with 256MB of on board RAM and nVidia Tesla C1060 with 4GB of on board RAM. The summary of the performance comparison is shown in Table 4.
FPGA device utilization figures are shown in Table 4 . The result indicates that the FPGA device is fully occupied under double-precision but not under single-precision, so a performance improvement can be achieved by replicating the evaluation core under single-precision.
From the results, it can be seen that the FPGA implementation on the xc4vlx160 achieved 10.9 times acceleration using single-precision, and achieved 9 times of acceleration using double-precision, with 1 QUAD evaluation core. As 3 cores can be replicated on xc4vlx160 using single-precision, 32.8 times acceleration can be achieved in the single-precision case.
For the performance of GPUs, a speedup of 12.5 times is achieved by Geforce 8600GT and a speedup of 59.8 times is achieved by Tesla C1060 in single-precision. In doubleprecision, the Tesla C1060 has shown a 31.6 times speedup over the reference PC, while there is no double-precision support in the Geforce 8600GT.
It can be seen that xc4vlx160 outperforms Geforce 8600GT by 2.6 times in single precision case. It is not surprising that Tesla C1060 GPU outperforms xc4vlx160 by 1.8 times, as Tesla C1060 is the latest GPU based on 65nm fabrication technology while Virtex-4 xc4vlx160 is an older model of FPGA based on 90nm fabrication technology. It would be fair to compare Tesla C1060 with the latest FPGAs such as Virtex-5 from Xilinx or Stratix IV from Altera, which would be addressed in our future work.
It is interesting to note that the xc4vlx160 FPGA demonstrates better performance than the GPUs when power consumption is taken into account. It can be seen that xc4vlx160 can evaluate 8.3 times more than both Geforce 8600GT and Tesla C1060 per unit of power usage under single-precision. In double-precision, xc4vlx160 evaluates 6.6 times more than Tesla C1060 and 93 times more than Pentium 4 per unit of power usage. An interesting observation is that Tesla C1060 is 4.7 times faster than Geforce 8600GT and consumes 4.7 times more power as well. Therefore Tesla C1060 has the same processing/power ratio as Geforce 8600GT.
Further performance improvement for FPGAs can be achieved using more up-to-date devices such as Virtex-5. As Virtex-5 has 4 times more slices than Virtex-4 and with higher basic clock frequency, at least 4 times more speedup can be achieved without further optimization. Also, although complex algorithms can be implemented easily in FPGAs with HyperStreams, maximum performance and utilization of FPGA resources is not guaranteed, as there is a tradeoff when using HyperStreams between the development time and the amount of acceleration that can be achieved. However, our HyperStreams implementation still provides a satisfactory result with significant acceleration over the software implementations. Therefore, HyperStreams is useful for producing prototypes rapidly to explore the design space. Further optimization can be applied after a promising architecture is found.
Fixed-point implementations with sophisticated techniques such as word-length optimization usually enable FPGA to achieve the best performance [17] . However, it is not applicable to quadrature methods as the range of the numerical values ranges widely from small size partial integral values to large size complete integral values.
Conclusion
This paper explores a novel architecture for hardware accelerated option pricing models based on quadrature methods. The proposed system involves a parallel architecture for general option pricing. This includes a highly pipelined datapath capable of supporting quadrature evaluation in parallel. We have implemented our designs on FPGA and GPUs. Our FPGA implementation can generally run 32.8 times faster than a Pentium 4 3.6GHz processor, 2.6 times faster than a GPU in comparable technology and 1.8 times slower than the latest GPU. From the power consumption perspective, the FPGA is up to 8.3 times more power efficient than GPUs and 93 times more power efficient than the CPU.
Current and future work includes exploring designs on the latest FPGAs such as Virtex-5 and Stratix IV, for a more sophisticated comparison with the latest GPUs. The automation of producing efficient FPGA and GPU implementations will also be investigated. Additionally, we intend to extend our work to cover the development of optimized hardware designs based on quadrature methods for a wide variety of applications, including the solutions of electromagnetic problems [2] and calculations involving photon distribution [3] .
