Abstract
Introduction
Monte Carlo (MC) methods [8] are very important in a number of application areas including aerospace engineering, bioinfomatics, molecular modelling and financial engineering. These methods are useful for obtaining numerical solutions to problems which are too complicated to solve analytically. Monte Carlo programs, due to their nature of consuming vast computing resources, have historically had to be executed upon the fastest computers available at the time, and employ the most advanced algorithms, implemented with effort invested in algorithmic optimisations and program efficiency. Large-scale Monte Carlo simulation typically require millions or even billions of random numbers for accurate simulations. The most expensive function in a typical Monte Carlo simulation is the PseudoRandom Number Generator (PRNG). For example, in a typical options pricing scenario using Black Scholes in the domain of financial engineering takes up about 40% of the resources [5] . Hence it is essential that the PRNG be of long period, have good statistical properties and be vectorisable.
The nature of MC simulations also place upon us mathematical restraints upon the sequence of the random numbers used for simulation. Key among these include: uncorrelated sequences, long period, uniformity and most importantly efficiency. The Mersenne twister is a is a twisted generalised feedback shift register pseudorandom number generator that address all these requirements. It is based on a matrix linear recurrence over a finite binary field F 2 and provides for fast generation of very high quality pseudorandom numbers. The Mersenne Twister [6] is called so because it's maximum period is related to the Mersenne Prime sequence. Specifically, the MT19937 PRNG has the same period as the 24 th Mersenne Prime (i.e. 2 19973 − 1). However, despite its algorithmic simplicity in comparison to other PRNGs, the sheer number of random numbers required for MC methods necessitate the use of efficient methods of generating random vectors.
Reconfigurable hardware in the form of FPGAs is an extremely powerful implementation approach for several reasons. First and foremost, it allows for truly parallel computations to take place in a circuit. With soft cores, dedicated logic, block multipliers and specialised versions available in modern FPGAs, they are being increasingly deployed in computationally intensive application areas involving the processing of massive volumes of data. The regular nature of the complex computations performed repeatedly within some application areas are well suited to a hardware based implementation using FPGAs. It is the aim of this paper to develop high throughput area efficient FPGA implementation of the MT19937 PRNG.
The composition of the rest of the paper is as follows. Related work is described in Section II. Mathematical and algorithmic details of the MT19937 are provided in Section III. A description of the proposed architecture is presented in Section IV. Implementation results and comparison with existing architectures and software implementations of the MT19937 are presented in section V. Concluding remarks are given in section VI.
Related Work
The MT was first published in [6] . MT is a variant of TGFSR modifies in order to allow a Mersenne prime period. The characteristic polynomial has many terms, and has good distribution upto v bits of accuracy for 1 ≤ v ≤ 32. The need for high performance implementations of the MT has been well recognised and a attempts at speeding up the MT in software have been made [3] . A limited number of hardware implementations of the MT have been made. In [9] an FPGA based implementation of the MT19937 has been presented using a three staged architecture. The three stages are seed generator, seed value modulator and output generation. However, there are no additional architectural details available. Additionally, the hardware platform used has not been specified; which renders it impossible to make full comparison of all performance metrics. In [7] a Mersenne Twister random number generator has been implemented using a Xilinx Virtex-5 LX50 device. This random number generator, described using Impulse C, uses ten parallel C-language processes that act as independent Mersenne Twister generators. After the initial testing had been performed, a series of iterative optimizations were performed, including an analysis of the generated pipeline structures. Recoding of the C-language statements to achieve a faster clock rate and more efficient hardware.
Mathematical and Algorithmic Background
In this section, the underpinnings of the MT19937 algorithm are described in mathematical terms, and a suitable algorithmic presentation in the form of a pseudocode.
Mathematical Basis of the MT19937
The primitivity test and k-distribution test that are needed in the parameter search for the MT19937 for a word x with bit width of w is expressed as the recurrence relation which is described as follows:
where: n is the degree of recurrence m is the middle word, or the number of parallel sequences, 1 ≤ m ≤ n u, l are Mersenne Twister tempering bit shifts and A is the twist transformation which is defined in rational normal form:
where: a is a coefficient of the rational normal form twist matrix; and where I n−1 as the (n − 1) × (n − 1) identity matrix where the bitwise XOR (⊕) operation replaces addition in the sum of products while implementing matrix multiplication. The rational normal form can be expressed in a more mathematically compact form as follows:
where
The maximum theoretical period of 2 nw−r − 1 can be achieved when Φ B (t) is a primitive polynomial and is a characteristic polynomial of:
and
where: r is the separation point of one word, or the number of bits of the lower bitmask, 0 ≤ r ≤ w − 1
The MT achieves equidistribution in n dimensions. In order to compensate for reduced dimensionality of equidistribution, tempering is performed as follows:
where: b, c are tempering bitmasks s, t are tempering bit shifts
The corresponding coefficients of MT19937 in line with the above and satisfying the restriction that 2 nw−r − 1 is a Mersenne prime are as follows: (w, n, m, r) = (32, 624, 397, 31) a = 9908B0DF 16 u = 11 (s, b) = (7, 9D2C568016) (t, c) = (15, EF C6000016) l = 18
MT19937 Pseudocode
The pseudocode of the MT19937 PRNG suitable for software implementation is presented in Codeblock 1. An implementation of this pseudocode using a .NET development platform on a PC equipped with a Core Duo processor and 1GB of system memory is also carried out to ensure algorithmic validity and to provide a point of reference with respect to the hardware implementation. 
Proposed Architecture for the MT19937 -Design and Evaluation
In this section, the high throughput architecture designed for the implementation of MT19937 is described.
Architecture Description
The three key functional blocks in the MT19937 architecture are the Initialise Generator (IG), Generate Numbers (GN) and Extract Number (EN). All three blocks are executed in parallel. To enable parallelisation without increasing hardware overhead, multiported block RAMs are used rather than arrays built from distributed logic. Additionally, the EN sub-block of the design is fully pipelined, allowing the combinational logic within each iteration of all three sub-blocks to execute concurrently. Common subexpression elimination in the design and the intelligent use of logic in place of the more expensive modulo operations enables us to further minimise the hardware complexity of the design.
The overall functional diagram of the proposed MT19937 architecture is as shown in Figure 1 . 
Implementation Details

Design Methodology and Hardware Platform
In order to verify the performance of the proposed architecture for MT19937, the design has beenperformed using Handel C and prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA. Available on chip logic resource include: 19200 Slices, 80 x 120 CLB Array, 655,360 bits Block RAM (BRAM) and 614,400 bits distributed RAM [2] . The RC1000 also has 4 memory banks which communicate with the host by means of DMA transfers [1] .
Performance Metrics
The proposed architecture operates at a maximum frequency of 22MHz. The design is the most compact FPGA implementation to date and occupies a minimal core footprint. Full details of all relevant performance metrics can be obtained form Table 1 . 
Comparison with Existing Architectures
Design parameters such as Time Complexity (TC), Area Complexity (AC) and I/O type of the proposed design with existing architectures are presented in Table 2 .
From the table, it can be clearly seen that the proposed architecture is much smaller than other existing implementations. Additionally, this compactness does not come at a performance cost. Through intelligent use of FPGA resources and design principles, high performance has been ensured. Since it is difficult to make meaningful direct comparisons of different platforms for throughput due to different device architectures and generational gaps, a term Throughput/MHz term has been introduced in order to normalise the amount of useful work done per system clock. It is clear from the table that the proposed architecture is the most efficient one in this regard. Additionally, it is worth mentioning that although the architecture in [7] has been implemented on the Virtex-5 which is four generations ahead of the Virtex-E FPGA that has been used in this paper, the work in [7] achieves only a 52% speedup which is primarily due to the increased clock speed achieved by the Virtex-5 due to architectural improvements. However, the inefficiency of this architecture is clearly demonstrated in the last row of Table 2 .
Conclusion
In this paper a high performance FPGA implementation of the MT19937 algorithm has been implemented. Block level parallelism and word level pipelining have been exploited to improve the performance. The proposed architecture has the smallest core footprint of all architectures in print and provides the maximum amount of performance per MHz of the platform used. This architecture is fully modularised and can be directly used as a PRNG for Monte Carlo methods. 14 N/A * The actual number of slices occupied on the Virtex-5 is 1600. However, the Virtex-5 has a 5 LUT per slice architectures unlike all previous generations of XIlinx FPGAs. Hence, a conversion factor is introduced to make the area occupied comparable. Throughput (T) is in Million numbers/second. T/MHz represents the Throughput per million clock cycles.
