Arithmetic Asian options are financial derivatives which have the feature of path-dependency: they depend on the entire price path of the underlying asset, rather than just the instantaneous price. This path-dependency makes them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture. The proposed pipelined design is described parametrically, facilitating its re-use for different technologies. An implementation of this architecture in a Virtex-5 xc5vlx330t FPGA at 200MHz is 313 times faster than a multi-threaded software implementation running on a Intel Xeon E5420 quad-core CPU at 2.5GHz; it is also 2.2 times faster than the Tesla C1060 GPU at 1.3 GHz.
INTRODUCTION
Arithmetic Asian options are examples of derivatives: financial instruments whose value is dependent on the price of some underlying asset such as a bond or stock. Unlike simpler options, which provide a payoff depending on the instantaneous price of the underlying, arithmetic Asian options provide a payoff depending on the arithmetic average price of the underlying during the option life-time. This averaging makes arithmetic Asian options cheaper and less sensitive to market manipulation, but also means there is no closed-form solution for the pricing.
Monte-Carlo methods provide a accurate way to price Asian options, but have slow convergence, so a huge number of simulations is needed. FPGAs have previously been explored for financial Monte-Carlo simulations, such as pricing European and American options, as they can exploit the large amounts of parallelism found in simulations. This paper extends the scope of FPGA accelerated simulations to include pathdependent Asian Options.
Our contributions are:
A highly-optimized pipelined parallel FPGA architecture for implementing Asian option pricing; this design is described parametrically to facilitate its re-use for different technologies.
A GPU algorithm for implementing Asian option pricing based on CUDA API.
Evaluation of the FPGA and GPU implementations versus a single-thread implementation on CPU, showing 313 times speedup for the FPGA, and 141 times for the GPU.
ASIAN OPTIONS
An option is a type of financial instrument which provides the owner of the option with the right, but not the obligation, to buy or sell an underlying asset such as a stock or bond at some point in the future. A call option allows the option owner to buy the underlying asset for some pre-agreed strike price K, while a put option gives them the right to sell at price K. The decision to exercise the option (i.e. buy or sell the asset) is always made by the option owner, and the option issuer has to abide by that decision, so the option owner must pay the issuer to create the option. Hence putting an accurate value on an option is critical for both parties.
For simple European call options, the owner can exercise only at the expiry date. If the underlying asset price S at expiry date is higher than the strike price K, the owner can profit by buying the stock at lower price K from the option issuer and then immediately selling it at the higher price S in the market, providing a gain of ( S -K). If the underlying asset price is lower than the strike price, S < K , then the gain is zero because the option will not be exercised.
The payoff of European call option on expiry is and the payoff of European put option at expiry is
For an arithmetic Asian option [4] , the payoff is calculated using the arithmetic average of the prices over the life time of the option. One advantage of this option type is that it is more difficult for the option issuer to manipulate market prices to reduce the option payoff, as the payoff depends on the path followed by the asset price, not just the price at expiry.
The payoff of an arithmetic Asian call option is: [I] . However, there is no such solution for arithmetic Asian options, due to their highly path-dependent properties. Monte-Carlo methods are commonly used to solve this problem. The idea is to generate a huge number of random paths for each probabilistic variable, then take the average of the results. We illustrate the idea with an pricing example with the following parameters So = 1.00, K = 1.03,r = 0.1,T = 2 and steps = 2. Table 1 shows an example of stock price paths. Firstly, 10 stock price paths from t = 0 to t = 2 are simulated. Then the average stock price for each path is calculated as in 'Avg' column. The payoff of each path is then calculated according to Equation 3 as in 'Payoff' column. Finally, the average payoff across all these paths is calculated. The final result is the expected value of the arithmetic Asian call option at t = 2. The arithmetic call option value at present time can be obtained by discounting this final answer backward by multiplying e-'T. The option price in the above example is 0.057. Financial analysis and pricing applications are often computationally intensive, so there has been much interest in the use of accelerators such as FPGAs and in this domain. Research into FPGA-accelerated pricing can be broadly split into two groups: lattice methods, which work backwards from exercise time to the current price, using a pre-determined lattice of asset prices and times; and Monte-Carlo methods, which work forwards from the current asset price to expiry time using multiple randomly chosen paths.
RELATED WORK
Lattice methods targeting FPGAs include binomial trees [3] , finite-difference methods [2] , and Quadrature pricing [ll] . Such algorithms are generally more efficient than Monte-Carlo methods. but thev cannot eas- Monte-Carlo methods are particularly suitable for implementation in FPGAs, as they contain abundant parallelism: each asset-price path can be evaluated in parallel with any other path, allowing exploitation of coarse- grained and pipeline parallelism. An FPGA-accelerated Monte-Carlo application covers pricing under the BGM interest rate model [12] . This provides 25 times speedup over software, using an optimised VHDL design and customised data widths. Other designs for European option have also been presented [lo] . Recent work has focused on complex types of MonteCarlo simulation, such as correlated asset prices [7] , American exercise features [9] , and discrete-event simulations [8] .
This paper extends parallel reconfigurable design for Monte-Carlo simulation to include path-dependent arithmetic Asian options, describing how path-dependency can be incorporated without sacrificing performance.
HARDWARE ARCHITECTURE
In this section, we present our hardware design for Asian option pricing. As shown later, the operator latency is described parametrically, facilitating the design to be re-used in different technologies. Figure 1 shows the overall hardware architecture. There are two main types of components in the design: one or more identical Monte-Carlo cores (MC); and a single shared Coordination Block (CB). The Monte-Carlo cores contain a Gaussian random number generator, a path simulation core and a result aggregation core; each MC core is capable of generating random asset price paths, calculating payoffs, and accumulating the average payoff. Multiple identical MC cores are instantiated to make maximum use of the device, so the Coordination Block manages the MC cores, allowing them to work in parallel to price the same option. The CB is also responsible for communicating with the external controller, for example a PC.
Monte-Carlo core

Gaussian random number generator
The Gaussian random number generator uses the piecewise linear generation method [6] , which provides highquality fixed-point Gaussian samples, while using only a small amount of logic and block-RAMS. To provide a good approximation to the Gaussian distribution, two independent piecewise linear RNGs are used, both of which provide a good approximation to the Gaussian distribution. The outputs of the generators are then added together, providing a better approximation to the Gaussian distribution, due to the Central Limit Theorem.
The resulting Gaussian RNG produces a stream of 24-bit fixed-point random numbers, with a period of 2lZ8. The quality of the stream has been checked with the Chi-squared test for sample sizes up to 232, and shows no significant deviation from the Gaussian distribution.
Path simulation
Under Black Schole's model where the stock price movement is governed by a geometric Brownian motion process, the stock price is given by the equation:
where r is the interest rate, v is the volatility of the underlying stock price, 6 t is the time period between two time steps, W is a Gaussian random number N(0, l ) , Si is the underlying stock price at step i and Si+1 is the underlying stock price at step i + 1. We could define the following equations:
The values of d r i f t and vsqrdt can be precomputed in advance and the path simulation core can simulate the stock price movement with these two static values. The payoff of arithmetic Asian options at expiry time is:
The option price at present time is defined as the payoff discounted backward to the current time as the following equation:
where T is the time to expiry. Therefore, we have the Monte-Carlo Asian option pricing algorithm as shown in Algorithm 1. There are 3 multiplexers namely MUXA, MUXB and MUXC controlling the computation flow. MUXA selects So at the beginning in order to calculate the S1 price. The signal s-path indicates the updated S in the path and is feed back to MUXA. Therefore, MUXA selects signal s-path afterward to provide a loop for iterating next Si. The loop containing MUXA and the second multiplication operator is the "stock price updating loop".
MUXB selects So at the beginning in order to compute the sum of price So and S1 to the result signal S-sum. S-sum is then feed back to MUXB to form the "sum of price updating loop". MUXB selects S-sum afterward after the So + S1 computation.
The number of the pipelined stages must be identical for all the pipelined loops in order to guarantee a consistent computation schedule. Let p be the maximum number of pipeline stages for these pipelined loops. Pipelined registers are added to ensure the number of pipelined stages of all loops equal to p. In this architecture, p = d,. Therefore, a d , -dm cycles pipelined delay register is inserted after the second multiplication oper- ator for balancing. As the computation is pipelined, the feedback result will reappear at the MUXA and MUXB after p cycles. Therefore, we simulate p paths at the same time in this pipelined fashion. The computed 1 step of S for the first simulation will arrive MUXA and MUXB just after the other p -1 computations of the other simulations.
MUXC selects the output of max operator only when that p simulations reached the end of the path (i.e. S-path reached S,). Therefore, MUXC only selects the max operator output for p cycles as the asia~callpayoff at that moment, and selects value 0 for the rest of time.
In conclusion, p path simulations are completed after p x steps + 3da + dm + d, + dd + d, cycles for the pipeline stages. The whole process repeats and we could expect another p completed path simulations after another p x steps cycles. Table 2 summarize the behavior of the MUXs in the path simulation core.
Result aggregation
The architecture of the result aggregation core is shown in Figure 3 . As discussed in Section 4.1.2, a batch of p payoff results are generated for every p x steps cycles and passed to the result aggregation core. These payoff results are accumulated until the number of batches reached the required number of batches. Let Nbatch be the required number of batches, N, , be the required number of Monte-Carlo simulations and C be the number of MC cores in the hardware. Nbatch is defined as: MUXD selects 0 for the first p x steps+4da +dm+d,+ dd + d, cycles for initialization, then it selects the accumulated payoff (signal payofS_sum) afterward to form a "sum of payoff loop". When the number of batches reached Nbatch, we have to aggregate the final p consecutive payoff-sum values together.
To aggregate these p consecutive values using a pstage pipelined adder, we make use of one direct feedback loop and one feedback loop involving a register with special clock-enable timing. MUXF selects 0 for initialization and selects the output of the last adder afterward to form a direct feedback loop. MUXE only selects the payoff-sum for p cycles when the final p consecutive values are ready and then selects the output of the register. The input of the register is connected to the output of the last adder. In other words, the register and MUXE form another feedback loop with a register in the middle. Table 3 shows the behavior of MUX in the result aggregation core. The clock-enable of the D-type register is controlled by a special signal sequence "accr". The "accr" is set to 1 for a dedicated timing so as to buffer the desired intermediate output of the adder. The desired intermediate result stays at the output of the register and the output of the MUXE.
If x is the index of clock cycle and p is the number of pipeline stages, the sequence of signal "accr" is given by the following equation: This register, with a special clock-enable signal sequence, is of general use for a design requiring a "reduce" function in a "map-reduce" computation with commutative operator with any number of pipeline stages. If a multiplier is used as the commutative operator, the result of nf=, Y , will be computed at the register output instead of Cf='=, x.
Coordination Block
Functional description
The Coordination Block is the main control unit 01 the overall hardware architecture. It provides the communication with the host PC. The option parameters are first sent from host PC to the Coordination Block. The Coordination Block then distributes the parameters to all MC cores. The communication time between the FPGA and PC is negligible as there are only a few bytes of input parameters and results transferred between them.
The Gaussian random number generators in MC cores are also initialized by the Coordination Block. Different sequences of bits are connected to different Gaussian random number generators as the random seeds. The Coordination Block also controls the overall timing of the computation. It generates MUXs selection signals and 1 "accr" signal sequence to all MC cores as discussed in previous subsection. The timing of generating these signals is followed strictly by the requirement as in Table 2 and Table 3 .
Path delay optimization
The counters, condition checking and controlling logic are implemented in the CB only instead of in the MC cores. In this way, the logic redundancy is significantly reduced. However, the use of global controlling signals may suffer from a decrease of clock rate due to a long critical path delay.
Path delay consists of 2 parts: logic delay and routing delay. When there are many computational cores, the routing delay of the controlling signal to the farthest computational core will be significant. If the logic delay of producing that controlling signal is long as well, the performance of a parallel architecture will be drastically reduced.
For example, if we set a global controlling signal m to 1 after a complex "condition A" checking, the example hardware code is as follow:
if Condition A then m <= 1 end if Assume that the logic delay for checking complex condition A is 4ns. If there is only 1 MC core, the routing delay for signal m from the logic result to the MC core is very short (<1ns). Therefore, the total path delay is less than 5ns and the hardware could be running at 200MHz. However, if there are many MC cores, the routing delay to the farthest MC core will be very long (up to 4ns in our experiment). If the routing delay is 4ns, the total path delay becomes 8ns and it could only be running at 125MHz, which is a drastic performance reduction.
Therefore, the hardware design of CB is optimized carefully with path delay partitioning. For the above hardware code, we divide it into two parts:
The routing delay of setting an internal state to SETM is very short (<1ns). Therefore, the path delay of the first part is less than 5ns even with the complex condition A checking.
The logic delay of checking "state = SETM " is also very short (<1ns). Therefore, the path delay of the second part including the long routing delay is still less than 5ns. As a result, the overall hardware could still be running at 200MHz.
This path delay partitioning optimization technique can be applied to any hardware design involving one control module sending controlling signals to multiple computational cores. It is an essential technique to maintain a high clock rate while maximizing the degree of parallelism.
GPU IMPLEMENTATION
Graphics Processing Units (GPUs) have been used for acceleration in various application [5] . They are Single Instruction Multiple Data (SIMD) computing devices. Parallelizable tasks are executed on the GPU as "kernel" by a computation grid. Each computation grid consists of a grid of thread blocks. Each block and thread has a unique block ID and thread ID. The "kernel" is executed by all threads in parallel with the same code, but on different sets of data. Intra-block communication between threads can be through the shared memory. The inter-block communication can be through the global memory. The speed of accessing thread's local register is the fastest. The speed of accessing shared memory is faster than accessing global memory. Our implementation on GPU is based on Compute Unified Device Architecture (CUDA) API provided for nVidia GPUs. A typical CUDA co-processing flow involve 4 steps:
• Copy processing data to GPU memory from main memory of the host.
• Instruct GPU to start processing.
• Wait till the threads inside GPU finished executing the kernel in parallel.
• Copy the result back to main memory. We design our CUDA implementation of Asian option pricing by 2 procedures, namely Gaussian random number generator procedure and path simulation procedure.
Gaussian random number generator procedure
In this procedure, we first allocate the GPU's global memory space for the total amount of random numbers that we needed for the simulations. If the number of simulations is N, the number of steps is M and single precision is used, we allocate 4NM bytes of global memory in GPU. Then we execute the Mersenne Twister random number generator kernel using all the threads to generate random numbers at the memory space. A Box-Muller transformation kernel is then executed on that memory space to form Gaussian random numbers.
Path simulation procedure
In the path simulation procedure, each thread simulates the price movement path as Algorithm 1 and sum up the payoff in the shared memory. Therefore, these payoff sums can be accessed by other threads in the same block. The first thread in each block then sum up all the payoff sums within the same block and stored it in the global memory location. Finally, a final aggregation kernel is executed by 1 thread only. This thread sums up the results by all blocks from the global memory location and returns the total payoff sum to the main program. The main program then compute the option price from the returned result.
IMPLEMENTATION RESULT
Our FPGA implementation targeted a Xilinx xc5vlx330t FPGA chip on an Alpha Data ADM-XRC-5T2 card. We design our hardware architecture manually in VHDL to maximize performance. The design is synthesized, mapped, placed and routed using Xilinx ISE 10.1.03. The floating operators used are from Xilinx FloatingPoint Operator 4.0. Single precision floating point arithmetic is used so that the parameters in Figure 2 and Table 4 . Our result shows that no more MC cores can be fitted in the hardware, as all the DSP48Es are used up for 16 MC cores. The performance of different implementations of arithmetic Asian call option pricing is evaluated. We choose an arithmetic Asian call option with parameters So = 100, K = 105, v = 0.15,r = 0.1, T = 10 and steps = 3650. The number of Monte-Carlo simulations is 10,000,0( The acceleration of FPGA implementations and GPU implementations is compared to a reference software implementation. The reference PC contains an Intel Xeon E5420 2.5GHz processor with 16GB RAM. The software implementation is written in the C language with the gsl library and compiled using gcc with speed optimization options -03 using OpenMP multi-threaded API. The number of threads used in software implementation is 4 in order to fully utilize the 4 cores in Xeon E5420. The targeted GPU is nVidia Tesla C1060 with 4GB of on board RAM. The summary of the performance comparison is shown in Table 5 .
From the results, it can be seen that a speedup of 313 times is achieved by the xc5vlx330t FPGA. For the GPU, a speedup of 141 times is achieved by Tesla C1060. The xc5vlx330t FPGA is 2.2 times faster than the Tesla C1060 GPU. The time for pricing an arithmetic Asian option is reduced from 1 hour to 11.4 seconds.
The maximum power usage of different devices is also estimated in Table 5 . The power usage of xc5vlx330t is estimated by Xilinx XPower Estimator 11.4.1 with toggle rate 100% and clock rate 200MHz. Since in our design all the flip-flops do not actually toggle all the time, setting the toggle rate to 100% is just to estimate the upper bound of power usage. It is interesting to note that the xc5vlx330t FPGA demonstrates higher performance in speed than the GPU and CPU, and consumes less power. If we have a cluster of PCs with Intel Xeon 2.5GHz without communication overhead between the nodes, then a cluster with 313 PC nodes would achieve the same performance as an xc5vlx330t FPGA. However, the energy consumption of that cluster would be 736 times more than an xc5vlx330t FPGA.
CONCLUSION
This paper presents a high performance hardware architecture for Asian option pricing. A CUDA based GPU implementation is also presented for performance comparison. To our knowledge, this is the first reported hardware implementation that accelerate~ the arithmetic Asian option pricing algorithm. By exploiting efficient Gaussian random number generators, massive parallelism and highly pipelined datapath, our FPGA implementation is faster than a comparable multi-threaded software implementation by 313 times, and it is faster than a CUDA based GPU implementation by 2.2 times. The maximum power consumption of the FPGA is also estimated to be much lower than those for the CPU and the GPU. This performance improvement in speed and in power consumption offers a practical solution to financial institutions to reduce option pricing time and costs, with considerable energy savings.
The presented hardware architecture demonstrates a practical reference design for high performance pathdependent financial simulation. With a slight modification of the path simulation core, the hardware architecture can be used for pricing any exotic and pathdependent financial options including lookback options and barrier o~tions.
Future wori includes the design of a distributed financial computation framework in a heterogeneous cluster 30. with FPGAs, GPUs and CPUs. The automated generation of the proposed efficient architectures will also be investigated.
