Abstract-We present a novel reconfigurable hardware architecture for accelerating American option pricing using the Binomial Lattice algorithm. The architecture provides double precision floating point pricing, evaluating up to N = 64, 000 time steps in the binomial lattice. Advanced memory management techniques and optimized control logic allow for 4-way parallelism on a single-asset evaluation. These techniques achieve a 73× speedup over an optimized CPU implementation, and a considerable improvement over the best previous reconfigurable hardware implementation. A significant advantage of our approach is that the speed up is on a per asset basis whereas all previous approaches on FPGA and GPU architectures achieve their speed up by evaluating many assets in parallel.
I. INTRODUCTION
This research addresses the acceleration of American option pricing. American options are interesting from a mathematical perspective because they have no closed form solution [1] . This makes them difficult to price both quickly and accurately. The time delay between changes in the underlying stock value and price of the American option creates a potential arbitrage opportunity. Low-latency, accurate pricing can give a significant competitive advantage to arbitrage-based fast trading strategies.
Typical algorithms discretize the underlying continuous pricing equations into N time steps, allowing for a tradeoff between computation time and precision through the choice of N . Real-life option evaluation requires an N = 10,000 to 50,000 steps for adequate precision. For a 2-year option, this corresponds to a discrete point roughly every 15 minutes. All known previous work is severely limited in this respect, reporting a maximum N = 1, 024 timesteps [2] , [3] . This reduced precision is due to fundamental architecture and platform limitations of previous designs.
This paper contributes the first high-precision, accelerated architecture for pricing American options. The architecture provides double precision floating point pricing, evaluating up to N = 64, 000 time steps in the binomial lattice. Advanced memory management techniques and optimized control logic allow for 4-way parallelism on a single-asset evaluation. These techniques achieve a 73× speedup over an optimized CPU implementation, and a considerable improvement over the best previous reconfigurable hardware implementation. A significant advantage of our approach is that the speed up is on a per asset basis whereas all previous approaches on FPGA and GPU architectures achieve their speed up by evaluating many assets in parallel.
The outline of this paper is as follows: Section II presents the algorithm for pricing the American option. Section III presents our acceleration approach on reconfigurable hardware. Section IV presents implementation details for architecture, memory management, and control strategies. Section V presents results and compares against previous work. Section VI presents conclusions and areas for further research.
II. AMERICAN OPTION PRICING
An American option is a financial instrument in which the owner has the right but not the obligation to buy (call option) or sell (put option) a stock for a specified price K (strike price) at time T (expiry). Without loss of generality, we will only consider the put option.
The value of the option is dependent on the price of the underlying stock. In a continuous model, the value of a stock in the risk neutral world is modeled as a geometric Brownian motion (1) with fixed drift r and volatility σ 2 .
European options can only be exercised at the maturity T , whereas American options can be exercised at any time up to T . The European option can be priced directly using the BlackScholes formula [4] . American options are more difficult to price. Pricing requires calculating an optimal exercise strategy, the stock price (at any given time) above which the option is exercised [1] .
A. Recombining Binomial Lattice
Rather than a continuous model, the Binomial Lattice approach models the underlying stock price as a discrete binomial lattice. At each time step the stock price goes up by a factor of λ + , or down by λ − :
The lattice is indexed by time t and index i. The value of exercising the option is computed using the logarithm (3) to improve numerical stability. The value of holding (4) is the discounted expected value of the future nodes. The value of the option is the maximum of exercising and holding:
Lattice nodes are evaluated backwards in time, starting from the boundary condition at expiry T (6). The final option value is v opt (0, 0).
The risk-neutral model is calibrated using real-world stock drift μ R and volatility σ R 2 . Constraining the risk neutral variables (7) makes it possible to cache the exercise values v ex at the start of each pricing. The risk-neutral binomial lattice parameters are then given by (8 -10) . Refer to [1] for algorithm details.
This Binomial Lattice algorithm is well suited for implementation with reconfigurable hardware. The computation of v opt at each node requires only two multiplies, an addition, and a compare. The algorithm requires O(N 2 ) node evaluations and 3N memory space (including the v ex cache).
B. Monte Carlo Simulation
Monte Carlo methods can be used to price European and American options. Pricing is straightforward for European options. Random paths are generated using geometric Brownian motion (1) , and each path is evaluated at expiry T (11). The final price is the expected value of v opt over all sampled paths.
Monte Carlo European option pricing is very fast on reconfigurable hardware, with reported speedups of 146× [5] , 250× [6] , and 340× [7] . For the American option, Monte Carlo methods are complicated by the need to find an optimal exercise boundary. This requires either sorting or regression [1] , which consumes significant configurable chip resources [8] , [9] , [10] . Performing sorting or regression off-chip introduces a communication bottleneck that quickly overwhelms any acceleration gains.
XtremeData's 250× European pricing architecture [6] is highly scalable, since path evaluations are completely independent. American option pricing eliminates this independence, as all paths need to be computed incrementally and simultaneously to estimate the optimal exercise boundary.
Beyond these technical hurdles, Monte Carlo pricing is stochastic, rather than deterministic. The risk of a statistically anomalous result has significant implications when the results are the basis for a trading strategy. For these reasons, the Monte Carlo method is not further considered.
C. Finite Difference Methods
These methods solve the American option pricing equation as a free boundary differential equation. The solution is arrived at directly, through finite differencing. These algorithms are typically highly sequential in nature, and therefore poorly suited for hardware acceleration. Additionally, differencing techniques have the potential to introduce numerical instabilities in certain situations [1] , which makes them less desirable for automated trading applications.
III. HARDWARE SELECTION
Graphical Processing Units (GPU) are highly-parallel chips originally designed for graphics acceleration. NVidia's CUDA programming language makes GPUs accessible for generalpurpose high performance computing. NVidia's Tesla S1070 GPU platform contains up to 120 GPU blocks, with 8 processing cores and 1 double precision floating point unit per block [11] . Each block has 64 Kb of local memory, with access to a much larger shared memory. Speedup is achieved by formulating the problem into a number of threads, which are then dispatched to the cores in parallel.
Field Programmable Gate Arrays (FPGA) are reconfigurable logic chips. FPGA's provide four mechanisms for algorithm acceleration. Logic elements can be pipelined, allowing multiple arithmetic operations to occur during a single cycle. Multiple copies of the pipeline can operate in parallel on a single chip. Local memory is large and highly configurable, reducing stalls from cache misses. Lastly, control logic can be implemented separately from the arithmetic pipeline, allowing for full pipeline utilization.
This project uses FPGAs for algorithm acceleration. Altera's Stratix III EP3SE260 FPGA has sufficient local memory to evaluate up to N = 64, 000 time step binomial lattices, whereas the S1070 GPU can accommodate at most N = 2, 730 time steps in local memory. Larger N requires the use of GPU shared memory, which incurs significant synchronization penalties due to pipeline stalls.
IV. IMPLEMENTATION
The American Put Binomial Lattice pricing algorithm is implemented entirely in double-precision floating point representation (IEEE 754-1985) . The architecture performs single asset pricing with up to N = 64, 000 time steps. The algorithm is written exclusively in Verilog, with Altera IP Cores performing floating point operations.
The architecture is optimized and implemented for the Stratix III EP3SE260F1152C2 FPGA featured on Terasic's DE3 development board. The implementation is verified using ModelSim v6.3g simulation, and compiled using Altera's Quartus v9.0 software. 
A. Architecture
( Fig. 1) shows the top-level architecture. First, v ex values are computed and cached (Fig. 2) . Next, the binomial lattice is evaluated (Fig. 3) . The lattice evaluation pipeline is based on previous work by Jin [3] .
B. Memory
Each node evaluation requires two reads and a write, all in double precision floating point (64 bits). With 4 replications running at 150 MHz, this amounts to 76.8 (Gbits/sec) reading and 38.4 (Gbits/sec) writing. This high-bandwidth memory access requires the use of on-chip memory resources.
Memory is organized into separate banks, with dedicated state machines generating read and write addresses for each memory bank. Node evaluation is performed for a single t, for 4 adjacent values of i. This adjacency simplifies address generation, and allows for the v opt (t, i + 1) necessary to calculate v hold (t, i) to also be used as v opt (t, i) for evaluating v hold for i = i + 1. Otherwise, node evaluation would require 3 reads per cycle [3] . The algorithm is evaluated in place; write addresses are the same as read addresses, delayed by 40 cycles to account for pipeline latency.
This precise memory management avoids stalls that would otherwise occur in a computer or GPU implementation, allowing for efficient parallel computation. Since we evaluate across four replications, our address ranges from 0 to 1000. The lattice is evaluated backwards in time from t = 4000 to t = 0. Runtime is proportional to the area under the line. The solid line shows the niave indexing. The tail on the far right is necessary due to data dependencies. The dashed line shows an optimized maximum address required, when nodes above the strike price are neglected. When the vex optimization is used, all addresses under the dotted line can be neglected. These two optimizations significantly reduce the area, and thus the computational complexity of analyzing the binomial lattice.
C. Control
The node evaluation pipeline has a 40 cycle delay between reading v opt (t, i), and reading valid data for v opt (t − 1, i). The latency is managed by setting the minimum index i to be greater than the latency. The binomial lattice indexing is constructed such that non-existent nodes can be evaluated without impacting the final result. This stall imposes a small overhead (less than 1000 cycles), and is critical for enabling low-latency parallel pricing of a single option. This is seen as the flat tail in the far right of Figure 4 . Two further control optimizations are possible to accelerate FPGA implementations. These are not implemented for this paper. First, observe that the price of all nodes above the original strike price is always zero. These can be left out of calculations without affecting results, reducing computation time by N 2 /8 if K = S 0 (Fig 4) . Also, observe that the exercise boundary is always nondecreasing with t [12] . When evaluating v hold (t, i) for i below the optimal exercise value, we know that v opt (t + 1, i) and v opt (t + 1, i + 1) can be reduced to v ex (t + 1, i) and v ex (t + 1, i + 1). The v opt only needs to be computed for values of i above the optimal exercise boundary. We can stop evaluating once we reach the optimal exercise boundary. This removes the area below the dotted line (Fig 4) . The speedup in standard C code is overwhelmed by the cost of the conditional branching introduced, but this could offer significant gains in an FPGA where conditional branching could be tested in parallel to pipeline operation. Table I shows compilation results for the Stratix III chip. Logic and DSP utilization is low, meaning there is room for more pipeline replications. Initial estimates suggest that we can increase pipeline replication from 4× to 32×. Table II shows a comparison of results. Our CPU benchmark runs on a single core of an AMD Turion 64 bit 1.6 GHz dualcore processor. The benchmark is written in C, and compiled using 'gcc -march=k8 -mfpmath=sse'. The benchmark node evaluations per second decrease as N increases due to cache 
V. RESULTS

A. Performance
B. Comparison
There are several previously published results for acceleration of American option binomial lattice pricing [2] [3] . Table  II compares our work with these previous implementations. To the author's knowledge, there has been no published work using hardware accelerated with Monte Carlo or finite difference methods to price American options.
Our implementation provides three improvements over previous work. First, it can evaluate lattices of up to N = 64, 000 time steps, whereas previous work only reported up to N = 1, 024 time steps [2] [3] . It is unclear how previous methods would scale to large N . Second, our results are computed entirely in double precision floating point, providing improved precision over previous GPU implementations. We compute exercise values using the logarithm (3) to provide improved numerical stability.
Lastly, previous implementations require large portfolios (up to several thousand options) to be priced simultaneously to achieve speedup. Our method accelerates single asset pricing, providing low-latency performance important for trading applications.
VI. CONCLUSION
This paper presents a scalable architecture for accelerating American option pricing using reconfigurable hardware, for binomial lattices with up to N = 64, 000 time steps. We achieve a 72× speedup over a CPU benchmark, and a significant improvement over previous reconfigurable hardware implementations.
Future work will increase the pipeline replication from 4× to 32×, and implement control optimizations for additional speedup. Other future research will develop this into an industrial implementation, by including capability for dividend paying stocks and estimation of the Greek sensitivity parameters. One further improvement could be to develop an Ethernet based C-interface to allow the acceleration to be accessed as a network resource. It is envisioned that pricing arguments could be passed as a URL, and the FPGA, acting as a web server, would return the price. This would allow for data-center scaling of the resource.
