Abstract. Monte Carlo simulation of weak approximations of stochastic differential equations constitutes an intensive computational task. In applications such as finance, for instance, to achieve "real time" execution, as often required, one needs highly efficient implementations of the multi-point distributed random number generator underlying the simulations. In this paper a fast and flexible dedicated hardware solution on a field programmable gate array is presented. A comparative performance analysis between a software-only and the proposed hardware solution demonstrates that the hardware solution is bottleneck-free, retains the flexibility of the software solution and significantly increases the computational efficiency. Moreover, simulations in applications such as economics, insurance, physics, population dynamics, epidemiology, structural mechanics, chemistry and biotechnology can benefit from the obtained speedups.
Introduction
In many applied sciences the dynamics of key quantities can be described by stochastic differential equations (SDEs). In finance, for instance, the evolution of security prices underlying derivative contracts is described by SDEs. To price a derivative security it is necessary to estimate the expectation of a function, the payoff, of the solution of the underlying SDE at a given maturity date. A widespread method and the only feasible one when the underlying securities follow a high dimensional SDE, is the Monte Carlo method. As explained in (5) , to price an option via Monte Carlo simulation one uses a weak approximation of the underlying SDE. Here only an approximation of the probability distribution is needed, while a pathwise strong approximation is relevant for other problems, such as scenario simulation or filtering.
Another important application of Monte Carlo simulation of weak approximations of SDEs is the solution of multi-dimensional partial differential equations (PDEs). In mathematical physics, for instance, multi-dimensional PDEs arise in many models. A feasible approach to their solution is the Monte Carlo simulation via their probabilistic representation, see (8) .
In a Monte Carlo simulation of a weak Taylor scheme, it is possible to replace the Gaussian random variables, approximating the increments of Wiener processes, with much simpler multi-point distributed random variables that match certain moments. For instance, in a simplified Euler scheme one can use twopoint distributed random variables that match the first three moments of the Wiener process increments. For a second order weak Taylor scheme one needs three-point distributed random variables. Such multi-point distributed random variables can be generated based on random bits, i.e. random variables with only the two possible values 0 and 1, each with probability 0.5.
The main reason for replacing the Gaussian random variables with variables based on random bits is simulation speed. Typical financial simulations can take hours or days to run even on powerful servers, thus making "real-time" evaluation unfeasible. In (1) it was shown that random bit generators (RBGs) significantly increase the computational efficiency of simplified weak Taylor schemes. However, in many applications a further speedup is required. Therefore, in this paper we present a dedicated hardware solution, based on a field programmable gate array (FPGA), for the efficient generation of random bits and the associated multi-point distributed random numbers. The choice of an FPGA as a dedicated hardware solution is mainly due to its flexibility, allowing the user to program different RBGs according to the order of convergence of the weak Taylor scheme employed. Moreover, the "randomness" of the random bits, which is crucial for an effective Monte Carlo simulation, is strictly related to the order and the coefficients of the underlying polynomial (9) . On the FPGA, the user is free to program the most suitable polynomial for any given application. This paper provides a twofold contribution to the literature, namely in the areas of Monte Carlo simulation of weak Taylor schemes and random number generation. In the following, we address the main works related to both areas. In (1) simplified weak Taylor schemes up to weak order two based on RBGs are proposed and a study on the efficiency of a related software implementation is reported. (3) provides a four-point distributed random variable for a simplified weak Taylor scheme of weak order three. However, such a four-point distributed random variable cannot be efficiently implemented via RBGs. Instead, in this paper we propose a five-point distributed random variable suitable for an efficient implementation of a simplified weak Taylor scheme of weak order three based on RBGs. In finalizing this paper the authors learned that the idea of using software-based RBGs for the simulation of simplified weak Taylor schemes and the proposal of a five-point distributed random variable have been independently suggested in (8) . However, no analysis of the computational efficiency nor implementation design are presented in (8) , whereas they are the objectives of this paper.
Random number generators (RNGs) can be divided into the two categories of true and pseudo random number generators. The former are based on "true" random physical phenomena while the latter are based on deterministic numerical algorithms. The main advantage of true RNGs is that the generated random numbers are independent and thus impossible to predict. This proves to be a crucial factor in applications such as encryption. On the other hand, the non reproducibility of the generated sequence makes it difficult to asses its statistical properties (9) . Moreover, the generation speed is unlikely to rival that of the fastest pseudo RNGs. In (6) an FPGA true RNG is proposed based on the jitter of a clock signal. The generation speed reported of 0.5 Mbit/s is far less than that achievable by a pseudo RNG. For these reasons, in Monte Carlo simulations pseudo RNGs are usually preferred to true RNGs.
In a software implementation on a modern personal computer (PC), the generation of a random number from a pseudo RNG can take as little time as a few nanoseconds. A dedicated hardware solution can certainly provide faster generation, but requires careful, bottleneck-free system design to prove really useful for the overall application. In (10), a hardware design for a pseudo RNG is presented but based on outdated discrete logic. In (12) , an FPGA implementation of pseudo RNGs is proposed but specifically for cryptographical applications. In (7), detailed results are presented for FPGA implementations of various pseudo RNGs which could also be used in Monte Carlo simulations. However, none of the above papers presents system-level integration or discusses system-level performance. In the current paper, instead, we propose a fast generator of multi-point distributed random numbers on an FPGA and describe its system performance in a PC architecture. The proposed approach has been tested over a wide variety of parameters, including different multi-point random variables and corresponding weak Taylor schemes, proving capable of achieving speedups of up to ten times with respect to an optimized software-only implementation.
Weak Taylor Schemes
Although the results presented in this section can be extended to multi-dimensional SDEs with time dependent coefficients, let us consider, for simplicity, the SDE 
We say that a discrete time approximation Y ∆ converges weakly to X at time T with order γ if for each polynomial g there exists a positive constant K, which does not depend on ∆, and a ∆ 0 > 0 such that
The first method that we consider for the approximation is the well-known Euler scheme given by
3) 
yielding the simplified Euler scheme
Since the two-point distributed random variables ∆ W 2 n match the first three moments of the Gaussian random variables ∆W n , the simplified Euler scheme (2.5) still achieves an order of weak convergence γ = 1.0.
When high accuracy is required it is important to be able to construct approximations with higher orders of weak convergence. As shown in (4), if we add more terms from the Wagner-Platen expansion to the Euler scheme (2.3), then we obtain the order 2.0 weak Taylor scheme 6) where ∆Z n represents the double Itô integral
For the sake of simplicity, in equation (2.6) and in the following we suppress the dependence of the coefficients on the numerical approximation Y n from the notation, meaning, for instance, we write a for a(Y n ), where n ∈ {0, 1, 2 . . . , N − 1}. In this case we can replace the Gaussian random variables ∆W n and ∆Z n by expressions involving the three-point distributed random variables ∆ W 3 n , where
to obtain the second order simplified method
Since the multi-point distributed random variables appearing in scheme (2.8) match the first five moments of those appearing in (2.6), the method (2.8) still achieves an order of weak convergence γ = 2.0.
By adding more terms from the Wagner-Platen expansion and approximating the arising multiple stochastic integrals with Gaussian random variables, we obtain the following order 3.0 weak scheme given by
where
This scheme achieves an order of weak convergence γ = 3.0.
To construct a third order simplified method, the required multi-point distributed random variables need to match, in general, the first seven moments of the Gaussian ones. In (3) a corresponding four-point distributed random variable ∆ W 4 n was proposed, where
However, the four-point distributed random variable appearing in (2.11) cannot be efficiently implemented by the method based on random bit generation described below because the probability values in (2.11) are not rational numbers.
Instead, we present here a five-point distributed random variable ∆ W 5 n , with
that still matches the first seven moments and is suitable for a highly efficient implementation based on RBGs. Therefore, we can present the third order simplified method For any given weak Taylor scheme it is possible to replace the Gaussian random variables by simpler multi-point distributed ones. As explained in (4), if the multi-point distributed random variables match the first 2γ + 1 moments of the multiple stochastic integrals in a weak Taylor scheme of order γ, then the resulting simplified Taylor scheme achieves the same order γ of weak convergence. For higher order schemes, however, the complexity of multi-point distributed random variables that match the required 2γ + 1 moments increases. For this reason, we consider here up to third order simplified methods, with (2.13) being the only third order scheme implemented.
Furthermore, the multi-point distributed random variables and the corresponding RBGs to be presented can be applied to any weak scheme, including derivative free and implicit schemes. For instance, the two-point distributed random variables ∆ W 2 n could be used in the first order simplified predictor-corrector method with corrector
and predictorȲ
The simplified predictor-corrector scheme (2.14)-(2.15) achieves an order of weak convergence γ = 1.0 and shows, in general, better numerical stability properties than the Euler scheme (2.5).
Moreover, simulations in economics, insurance, physics, population dynamics, epidemiology, structural mechanics, chemistry and biotechnology for models specified via SDEs can greatly benefit from the above methods.
Multi-point Random Variables and Random Bit Generators
In (1) a highly efficient software implementation of simplified schemes based on RBGs has been proposed. The two-point distributed random variables ∆ W 2 n , which constitute the core of the simplified Euler scheme (2.5), can be efficiently obtained with a single bit from the RBG. This is an algorithm that generates a bit 0 or 1 with probability 0.5. Random bits can be obtained via the so-called shift register generator. This generator, used in digital communication (2) , relies on the theory of primitive polynomials modulo 2. These are special polynomials of the form
with coefficients c i = {0, 1}. A primitive polynomial modulo 2 of order n defines a recurrence relation for obtaining a new bit from the n preceding ones with maximal period, which is 2 n − 1. The recurrence is given by:
where a k+1 is the new bit obtained from the preceding ones, a i , with i ∈ {n, . . . , 1} and k > n. Equation (3.17) can be rewritten as
where ⊕ is the "exclusive or" operator. Thus, RBGs can be efficiently implemented in C via bitwise operations, see (11) and (1) . In (1) a Monte Carlo simulation of an option pricing problem using the simplified first order scheme (2.5) with the RBG (3.18), provided a speed up of about 28 times when compared to a first order scheme based on Gaussian random variables.
For a first order simplified scheme (2.5), each bit obtained from the RBG is used to generate a value for the two-point distributed random variable by a simple lookup operation (0 → + √ ∆, 1 → − √ ∆). For a second order simplified scheme (2.8),
one bit is not sufficient to generate a value for the required three-point distributed random variable, ∆ W 3 n . However, a sequence of three generated random bits is first used to obtain eight equiprobable values. Then, with an acceptance-rejection method we discard two of them, use four to generate the 0 value for the random variable and use one each for values + √ 3∆ and − √ 3∆. For the third order simplified scheme (2.13), the random variable is five-point distributed with the probability distribution (2.12). In this case, a sequence of five random bits is used to generate 32 equiprobable combinations. The acceptance-rejection method discards two of them, uses ten to generate the 0 value, nine each for values + √ ∆ and − √ ∆, and one each for values + √ 6∆ and − √ 6∆.
4 System Architecture and FPGA Implementation
System Architecture
Our ultimate goal is to substantially speed up the above described Monte-Carlo simulations by moving the random number generation from software to a dedicated hardware platform. More precisely, we aim to move the whole generation of multi-point distributed random numbers from the host processor to a dedicated hardware unit. Since the percentage of time typically taken by the generation of such numbers can be as high as 60-70% of the total execution time (1), the above appears to be a promising approach. However, there exist critical performance challenges at the system level. As the typical generation time for a software implementation can be as short as a few nanoseconds per number, the dedicated hardware solution must avoid any system-level bottlenecks to prove competitive.
Our basic idea for a PC environment is that of implementing the hardware "accelerator" as a daughter board on the PCI (peripheral component interconnect) bus, Revision 2.2 (13). The daughter board hosts the RNG, not to be confused with the RBG which is just a part of it. The RNG is implemented on an FPGA and returns the generated numbers to the simulation software through the PCI bus. Fig. 4 .1 shows our proposed system architecture for a PC platform. We can divide the system's operations in four phases. In phase 1, the FPGA generates a set of random numbers (in a T FPGA average time per number). In phase 2, the FPGA transfers such a set in a compact, combinatorially encoded format to the The combinatorial encoding works as follows. Any given multi-point distributed random variable has a small finite set of n possible values, with each value typically represented as a 32-bit floating point datum. We can encode each value by combinatorial encoding with [log 2 n] bits, where [a] denotes the smallest integer greater than or equal to a. Accordingly, the amount of random numbers that we are able to pack and transfer in a single PCI data phase (32 bits of data over 30 ns) is much larger than that possible with the native floating point representation (only one number per data phase). In phase 3, the host processor is ready to serve the requests for random numbers from the simulation software. At each request, the host processor decodes one encoded number and returns it to the caller (in a T dec average time per number). In phase 4, the host processor uses the random numbers (in a T use average time per number). In this way, the simulation software sees the system through the same function interface of a conventional softwareonly implementation and requires no further modifications. More importantly, we coordinate these phases into a pipeline so that the simulation software uses the current set of generated random numbers (phase 4) while the FPGA concurrently produces a new set (phase 1), thus obtaining a significant speedup. Fig. 4 .2 shows how the various phases occur with respect to time. It can be seen that if T FPGA is less than T use , then phase 1 is completely hidden by phase 4 and thus adds no time to the total execution time. With a more aggressive implementation, also phases 2 and 3 could have been considered for pipelining with other phases. In particular, phase 2 could be overlapped with phase 1 by means of a double-buffer implementation on the FPGA. At its turn, phase 3 could be overlapped with phases 1 and 2 by a double-buffer implementation in the host memory. Note that phase 3 cannot overlap with phase 4 as they both require the same resource, the host processor. It can be shown that such changes could result in hiding T comm completely in the overall execution time. On the other hand, T dec will instead increase due to the increased complexity of a multiple-buffer implementation, thus compromising the speedup. For this reason, we decided to limit pipelining to the two main phases, 1 and 4.
The complete time models for the simulation are given in the following. First, we Phase Time T_FPGA (1) T_comm (1) T_dec (1) T_use (1) can define T gen as the average time spent for generating a multi-point distributed random number and T use as the average time spent by the rest of the simulation software in using it. If generation and use are sequential, we can write:
where T exe is the average total execution time per number.
In the case of a conventional software implementation the above model holds and T gen = T gen SW is the time taken by the execution of a function that generates and returns one multi-point distributed random number to the caller.
With our system, instead, the simulation software uses the current set of random numbers while the FPGA concurrently generates a new set. In this way, if the generation time on the FPGA, T FPGA , is shorter than the use time, T use , the former does not add up to the total execution time. Such a constraint was largely satisfied in all our experiments. Hence, (4.19) still holds with T gen = T gen HW given by:
FPGA Implementation
A fast and flexible implementation of the RNG is the main requirement in this application. FPGAs enjoy several features such as quasi-ASIC (application specific integrated circuit) speed and programmer-level flexibility, which makes them the most suitable option for the hardware platform. Accordingly, we have chosen to implement our generator on a high-performance FPGA, the Altera Stratix EP1S1OB672C6. Simulation tools for this device are available in the Altera Quartus II development environment. We have used the Web Edition Software Version 4.2 of such tools. Moreover, all the circuits for the FPGA have been developed in VHDL (Very High Speed Integrated Circuit Hardware Description Language). ) shows details of the generator. The generation of the random bits is performed by a shift register generator of programmable length equal to that of the generating polynomial. The "active" (i.e. non-null) coefficients can also be programmed by the user. The orders considered for the weak Taylor scheme range from γ = 1 to γ = 3, although higher orders can also be straightforwardly implemented. When the selected order is γ = 1, the RNG generates numbers sampled from a two-point distributed random variable with the probabilities described in (2.4). Each single bit generated by the shift register generator represents a valid encoded number. When the selected order is γ = 2, the RNG generates numbers sampled from a three-point distributed random variable with the probability distribution (2.7). In this case, a sequence of three generated random bits, X1:X3, is used to generate eight equiprobable combinations. As described in Section 3, the accept/group logic discards two of them, uses four to generate a 0 value for the random variable and uses one each for values + √ 3∆ and − √ 3∆. When the selected order is γ = 3, the random variable is five-point distributed with the probability distribution (2.12). In this case, a sequence of five random bits, X1:X5, is used to generate 32 equiprobable combinations. The accept/group logic discards two of them, uses ten to generate a 0 value, nine each for values + √ ∆ and − √ ∆, and one each for values + √ 6∆
and − √ 6∆. All combinatorial functions in the accept/group logic are optimised.
5 Experimental results and performance analysis Table 1 shows the main performance results of the proposed implementation for a polynomial order of 31 and the different weak Taylor scheme orders. F ck (in MHz) is the maximal clock frequency obtained for the RNG. T FPGA , the time (in ns) for generating one multi-point distributed random number, is computed as α * 10 3 /F ck . The α term accounts for the fact that some of the clock cycles generate a random number that should be rejected; such a factor is 8/6 for the three-point distributed random variable and 32/30 for the five-point distributed one. T use , the time spent by the application in using a random number, is measured on the option pricing problem reported in (1). From Table 1 , it is possible to see that the constraint of (4.20) is always easily satisfied. Although T use obviously depends on the application, its range will be similar for comparable Monte Carlo simulations. T comm is the average time for transferring one random number from the FIFO queue to the host memory over the PCI bus. This time increases proportionally to the size in bits of the encoded random numbers. Moreover, in some cases the data require extra-alignment bits to match the 32-bit PCI data size. For instance, this applies to the case of the 3-bit encoded numbers sampled from a five-point distributed random variable. In Table 1 , T comm is computed based on a transfer rate of 66 MB/s. However, there exist several implementations over the PCI bus which can almost saturate its peak rate of 132 MB/s; hence, even smaller values for T comm are achievable. Moreover, the upcoming PCI Express TM bus carries the potential to further decrease T comm by at least a factor of 4. Based on these parameters and thanks to our design choice of combinatorial encoding for the generated random numbers, we have proved herewith that data communication is not a performance bottleneck in our system. Moreover, we have implemented highly-optimised C macros to perform the decoding operation on the host side, thus also limiting T dec , the average time that the host processor takes to decode one encoded random number and return it to the requesting application. Table 2 shows the main performance results for a much higher polynomial order of 521. T use , T comm , and T dec are not influenced by the polynomial order. It can also be seen that the FPGA performance does not suffer from the increased polynomial length and in some cases even slightly exceeds that of the polynomial order 31. As the implementation still uses a very small fraction of the FPGA resources, we cannot see any practical upper bound on the choice of the polynomial length.
To provide a comparative analysis between software and hardware performance, we have implemented both software and hardware versions of the RNG for a comprehensive variety of parameters. In order for the performance comparison to be unbiased, we have implemented all software functions as highly speedoptimised C macros. The reference PC is a Mobile Pentium 4 2.0 GHz and the C compiler used is the Microsoft Visual Studio 6.0 with -O2 optimizations. In the following, the three dimensions of the polynomial order, number of non-null coefficients, and number of points of the random variable are discussed.
Polynomial order
A polynomial order n, for a primitive polynomial modulo 2, guarantees a period of 2 n − 1 for the generated random sequence. It is known that the accuracy of a simulation based on a pseudo-random sequence is compromised when the sequence length is substantial compared with the period of the RNG. In the light of this, high order polynomials should be preferred. However, in a software implementation one faces an increase in generation time when using high order polynomials, since they cannot be mapped onto a single primitive-type operand. Instead, the hardware implementation does not suffer from any predefined operand size. Fig.  5.4 shows the generation time, T gen , for the software and hardware implementations as a function of the polynomial order. For the software implementation, T gen remains approximately stable up to 63 and then starts to grow with the polynomial order. Yet, the time for the hardware implementation always remains constant. In our tests, even larger polynomial sizes did not introduce any further 
Number of non-null coefficients
The "randomness" of the random bits, which is crucial for an effective Monte Carlo simulation, is strictly related not only to the order of the generating polynomial but also to the choice of its (non-null) coefficients (9) . However, in a software implementation a programmer is tempted to use the polynomial with the smallest number of coefficients, as each introduces an additional computational load. Fig. 5 .5 shows that the software implementation suffers from a proportional delay. Again, the time instead remains constant for our hardware implementation as T FPGA remains less than T use in all cases of interest. 
Multi-point random variables
When high accuracy is required, higher orders of the weak Taylor schemes will eventually increase the computational efficiency, even though both the scheme and the multi-point distributed random variables are more complex. In any case, speeding up the computation of the random variables has a dramatic impact on the simulation time. Fig. 5.6 shows the generation time, T gen , for the software and hardware implementations as a function of the number of points in the multipoint distributed random variable, which refers to ∆ W .7) and (2.12), for a polynomial order of 31. Once again, the software time grows steadily, up to 80 ns per value for a five-point distributed random variable. The hardware time, instead, increases negligibly. Actually, the increase in T gen is due only to the larger size of the encoded random numbers. The size of the encoded random numbers grows as [log 2 n], where n is the number of points representing the possible values of the multi-point distributed variable, and this has an impact on the transfer time, T comm , and the decoding time, T dec , see Tables  1 and 2 . Fig. 5.7 shows that the trend is similar for a polynomial order of 521. Table 3 reports the speedups achieved with the proposed hardware solutions with respect to the optimised software implementation, when considering the multipoint distributed random variables ∆ W 2 n , ∆ W 3 n and ∆ W 5 n and polynomial orders of 31 and 521. S gen = T gen SW /T gen HW is the speedup between the generation in hardware and that in software. As explained in (4.20), T gen HW does not account for the generation time on the FPGA device, but it consists, rather, of communication and decoding times. The units responsible for such times are mainly the PCI bus and the host processor. While S gen is the main performance figure in our system, it is important to report also S FPGA = T gen SW /T FPGA , which is the speedup between the generation on the FPGA alone and that in software. This speedup is important to express the relative performance of the FPGA device and the host processor in the generation of multi-point distributed random variables in view of a possible transfer of the whole simulation to FPGAs. Table  3 shows that such a speedup is as high as 12.84 and could possibly increase by using FPGA development tools providing further optimizations. Table 4 reports the application speedup of the proposed hardware solutions with respect to the optimised software implementation S exe = T exe SW /T exe HW when an option pricing problem reported in (1) is considered. Table 4 shows that the overall application strongly benefits from the hardware acceleration, up to almost three times in some cases. This is due to the large percentage of the total execution time typically spent on the random number generation by the software. Moreover, the speedup increases with the order of the polynomial and also with the number of its non-null coefficients, which is not shown in the table. Therefore, these speedups become more significant in the case of high accuracy simulations.
Speedup
From Table 4 , it appears that the application speedup for a weak Taylor scheme of order γ = 2 is greater than that for order γ = 3. However, such a result is not general since the measured times and speedups can depend significantly on the compiler used. To verify that, we measured T use also with another compiler, the Mingw port of GCC. Here we obtained 14, 47 and 57 ns per random number for a weak Taylor scheme of order 1, 2, and 3, respectively. Such times, when compared to those obtained by the Microsoft compiler and reported in tables 1 and 2, seem to be in better proportion with the complexity of the operations in (2.5), (2.8) and (2.13). With such times, the application speedup for the weak Taylor scheme of order γ = 2 is equivalent to that of order γ = 3.
Conclusion
In this paper we have presented a dedicated hardware solution on an FPGA for the generation of multi-point distributed random variables for use in a PC environment. The proposed solution uses a high-performance FPGA for fast and flexible generation of the random bits and the associated multi-point distributed random numbers. Moreover, the random numbers are transferred to the host processor in an encoded format with PCI burst cycles for maximum communication efficiency. Thanks to this bottleneck-free system design, the proposed solution achieves relevant speedups in generating the random numbers for the application when compared to an optimised software-only solution, ranging from 1.4 up to about 10 times. As the typical percentage of time employed by random number generation in Monte Carlo simulations is relevant, this speedup, in turn, provides a significant improvement on the efficiency of the overall Monte Carlo simulation.
In this paper, we reported a case of a finance application where the speedup on the total simulation time reaches 2.93. Moreover, when high accuracy in the solution is required, thus involving higher order schemes and larger polynomial orders, the speedup on the total simulation time proves even greater. It is important to note that other applications such as in economics, insurance, physics, population dynamics, epidemiology, structural mechanics, chemistry and biotechnology, with models specified via SDEs and solved by Monte Carlo simulations, can benefit greatly from the proposed solution.
