High performance FPGA implementation of the mersenne twister by Chandrasekaran, S & Amira, A
High Performance FPGA implementation of the Mersenne Twister
Shrutisagar Chandrasekaran and Abbes Amira ∗
Electronic and Computer Engineering
Brunel University, West London, UK
∗ abbes.amira@brunel.ac.uk
Abstract
Efficient generation of random and pseudorandom se-
quences is of great importance to a number of applica-
tions [4]. In this paper, an efficient implementation of the
Mersenne Twister is presented. The proposed architecture
has the smallest footprint of all published architectures to
date and occupies only 330 FPGA slices. Partial pipelining
and sub-expression simplification has been used to improve
throughput per clock cycle. The proposed architecture is
implemented on an RC1000 FPGA Development platform
equipped with a Xilinx XCV2000E FPGA, and can gener-
ate 20 million 32 bit random numbers per second at a clock
rate of 24.234 MHz. A through performance analysis has
been performed, and it is observed that the proposed archi-
tecture clearly outperforms other existing implementations
in key comparable performance metrics.
1 Introduction
Monte Carlo (MC) methods [8] are very important in
a number of application areas including aerospace engi-
neering, bioinfomatics, molecular modelling and financial
engineering. These methods are useful for obtaining nu-
merical solutions to problems which are too complicated
to solve analytically. Monte Carlo programs, due to their
nature of consuming vast computing resources, have histor-
ically had to be executed upon the fastest computers avail-
able at the time, and employ the most advanced algorithms,
implemented with effort invested in algorithmic optimisa-
tions and program efficiency. Large-scale Monte Carlo sim-
ulation typically require millions or even billions of ran-
dom numbers for accurate simulations. The most expensive
function in a typical Monte Carlo simulation is the Pseudo-
Random Number Generator (PRNG). For example, in a typ-
ical options pricing scenario using Black Scholes in the do-
main of financial engineering takes up about 40% of the re-
sources [5]. Hence it is essential that the PRNG be of long
period, have good statistical properties and be vectorisable.
The nature of MC simulations also place upon us math-
ematical restraints upon the sequence of the random num-
bers used for simulation. Key among these include: un-
correlated sequences, long period, uniformity and most im-
portantly efficiency. The Mersenne twister is a is a twisted
generalised feedback shift register pseudorandom number
generator that address all these requirements. It is based
on a matrix linear recurrence over a finite binary field F2
and provides for fast generation of very high quality pseu-
dorandom numbers. The Mersenne Twister [6] is called
so because it’s maximum period is related to the Mersenne
Prime sequence. Specifically, the MT19937 PRNG has the
same period as the 24th Mersenne Prime (i.e. 219973 − 1).
However, despite its algorithmic simplicity in comparison
to other PRNGs, the sheer number of random numbers re-
quired for MC methods necessitate the use of efficient meth-
ods of generating random vectors.
Reconfigurable hardware in the form of FPGAs is an ex-
tremely powerful implementation approach for several rea-
sons. First and foremost, it allows for truly parallel compu-
tations to take place in a circuit. With soft cores, dedicated
logic, block multipliers and specialised versions available
in modern FPGAs, they are being increasingly deployed
in computationally intensive application areas involving the
processing of massive volumes of data. The regular nature
of the complex computations performed repeatedly within
some application areas are well suited to a hardware based
implementation using FPGAs. It is the aim of this paper to
develop high throughput area efficient FPGA implementa-
tion of the MT19937 PRNG.
The composition of the rest of the paper is as follows.
Related work is described in Section II. Mathematical and
algorithmic details of the MT19937 are provided in Section
III. A description of the proposed architecture is presented
in Section IV. Implementation results and comparison with
existing architectures and software implementations of the
MT19937 are presented in section V. Concluding remarks
are given in section VI.
4th IEEE International Symposium on Electronic Design, Test & Application
0-7695-3110-5/08 $25.00 © 2008 IEEE
DOI 10.1109/DELTA.2008.113
482
4th I  International Sy posiu  on lectronic esign, est  pplications
2 Related Work
The MT was first published in [6]. MT is a variant of
TGFSR modifies in order to allow a Mersenne prime period.
The characteristic polynomial has many terms, and has
good distribution upto v bits of accuracy for 1 ≤ v ≤ 32.
The need for high performance implementations of the MT
has been well recognised and a attempts at speeding up the
MT in software have been made [3]. A limited number of
hardware implementations of the MT have been made. In
[9] an FPGA based implementation of the MT19937 has
been presented using a three staged architecture. The three
stages are seed generator, seed value modulator and out-
put generation. However, there are no additional archi-
tectural details available. Additionally, the hardware plat-
form used has not been specified; which renders it impos-
sible to make full comparison of all performance metrics.
In [7] a Mersenne Twister random number generator has
been implemented using a Xilinx Virtex-5 LX50 device.
This random number generator, described using Impulse
C, uses ten parallel C-language processes that act as in-
dependent Mersenne Twister generators. After the initial
testing had been performed, a series of iterative optimiza-
tions were performed, including an analysis of the gener-
ated pipeline structures. Recoding of the C-language state-
ments to achieve a faster clock rate and more efficient hard-
ware.
3 Mathematical and Algorithmic Back-
ground
In this section, the underpinnings of the MT19937 algo-
rithm are described in mathematical terms, and a suitable
algorithmic presentation in the form of a pseudocode.
3.1 Mathematical Basis of the MT19937
The primitivity test and k-distribution test that are
needed in the parameter search for the MT19937 for a word
x with bit width of w is expressed as the recurrence relation
which is described as follows:
xk+n = xk+m ⊕ (xuk |xlk+1)A k = 0, 1, 2 (1)
where:
n is the degree of recurrence
m is the middle word, or the number of parallel sequences,
1 ≤ m ≤ n
u, l are Mersenne Twister tempering bit shifts
and A is the twist transformation which is defined in rational
normal form:
A =
(
0 In−1
an (an−1,...,a0)
)
(2)
where:
a is a coefficient of the rational normal form twist matrix;
and
where In−1 as the (n − 1) × (n − 1) identity matrix
where the bitwise XOR (⊕) operation replaces addition in
the sum of products while implementing matrix multiplica-
tion. The rational normal form can be expressed in a more
mathematically compact form as follows:
xA =
{
x >> 1;x0 = 0
(x >> 1)⊕ a;x0 = 1 (3)
where
x =
(
xuk |xlk+1
)
k = 0, 1, ..
The maximum theoretical period of 2nw−r − 1 can be
achieved when ΦB(t) is a primitive polynomial and is a
characteristic polynomial of:
B =


0 Iw · · · 0 0
.
.
.
Iw
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 · · · Iw 0
0 0 · · · 0 Iw−1
S 0 · · · 0 0


(4)
and
S =
(
0 Ir
Iw−r 0
)
(5)
where:
r is the separation point of one word, or the number of bits
of the lower bitmask, 0 ≤ r ≤ w − 1
The MT achieves equidistribution in n dimensions. In
order to compensate for reduced dimensionality of equidis-
tribution, tempering is performed as follows:
y = x⊕ (x >> u)
y = y ⊕ ((y << s)&b)
y = y ⊕ ((y << t)&c)
z = y ⊕ (y >> l)
(6)
where:
b, c are tempering bitmasks
s, t are tempering bit shifts
The corresponding coefficients of MT19937 in line with
the above and satisfying the restriction that 2nw−r − 1 is a
Mersenne prime are as follows:
(w, n,m, r) = (32, 624, 397, 31)
a = 9908B0DF16
u = 11
(s, b) = (7, 9D2C568016)
(t, c) = (15, EFC6000016)
l = 18
483
3.2 MT19937 Pseudocode
The pseudocode of the MT19937 PRNG suitable for
software implementation is presented in Codeblock 1. An
implementation of this pseudocode using a .NET develop-
ment platform on a PC equipped with a Core Duo processor
and 1GB of system memory is also carried out to ensure al-
gorithmic validity and to provide a point of reference with
respect to the hardware implementation.
Algorithm 1 Pseudocode for the MT19937 algorithm
declare 32 bit integer MT [624]
declare 32 bit variable y
//The following function initialises the generator given a
32 bit seed
function initialiseGenerator ( 32 bit integer seed )
MT [0] = seed
for i = 1 : 623 do
MT [i]=last 32bits of(1812433253×(MT [i− 1]⊕
right shift by 30 bits(MT [i− 1])) + i)
end for
// The following function generates an array of 624 un-
tempered numbers
function generateNumbers()
for i = 0 : 623 do
y = 32nd bit of(MT [i]) +
last 31 bits of(MT [(i+ 1)%624])
if y is even then
MT [i] = MT [(i+ 397)%624]⊕
(right shift by 1 bit(y))
else
MT [i] = MT [(i+ 397)%624]⊕
(right shift by 1 bit(y))⊕(2567483615)
end if
end for
// The following function extracts a tempered pseudoran-
dom number based on the ith value
function extractNumber(integer i)
y = MT [i]
y = y⊕(right shift by 11 bits(y))
y = y⊕(left shift by 7 bits(y)&(2636928640))
y = y⊕(left shift by 15 bits(y)&(4022730752))
y = y⊕(right shift by 18 bits(y))
return y
4 Proposed Architecture for the MT19937 -
Design and Evaluation
In this section, the high throughput architecture designed
for the implementation of MT19937 is described.
4.1 Architecture Description
The three key functional blocks in the MT19937 archi-
tecture are the Initialise Generator (IG), Generate Numbers
(GN) and Extract Number (EN). All three blocks are exe-
cuted in parallel. To enable parallelisation without increas-
ing hardware overhead, multiported block RAMs are used
rather than arrays built from distributed logic. Addition-
ally, the EN sub-block of the design is fully pipelined, al-
lowing the combinational logic within each iteration of all
three sub-blocks to execute concurrently. Common sub-
expression elimination in the design and the intelligent use
of logic in place of the more expensive modulo operations
enables us to further minimise the hardware complexity of
the design.
The overall functional diagram of the proposed
MT19937 architecture is as shown in Figure 1.
Initialise
Generator
Generate
Numbers
Extract
Numbers
Master Controller
M
u
lti
-
po
rt
e
d 
B
lo
c
k 
R
A
M
 
M
T
[6
2
4]
Init
Seed
Input port
Parallel execution cores
Output
Port
Pipelined
number
extraction
Figure 1. Overall hardware architecture for
MT19937
5 Implementation Details
5.1 Design Methodology and Hardware
Platform
In order to verify the performance of the proposed archi-
tecture for MT19937, the design has beenperformed using
Handel C and prototyped on the Celoxica RC1000 board
containing the Xilinx XCV2000E FPGA. Available on chip
484
logic resource include: 19200 Slices, 80 x 120 CLB Array,
655,360 bits Block RAM (BRAM) and 614,400 bits dis-
tributed RAM [2]. The RC1000 also has 4 memory banks
which communicate with the host by means of DMA trans-
fers [1].
5.2 Performance Metrics
The proposed architecture operates at a maximum fre-
quency of 22MHz. The design is the most compact FPGA
implementation to date and occupies a minimal core foot-
print. Full details of all relevant performance metrics can
be obtained form Table 1.
Table 1. Key performance metrics of the pro-
posed architecture for MT19937
Key Metric Performance
Maximum Frequency 24.234MHz
Number of Slices 330
Total LUTs used 539
Number of RAMB16s 2
Throughput 22M numbers/second
5.3 Comparison with Existing Architec-
tures
Design parameters such as Time Complexity (TC), Area
Complexity (AC) and I/O type of the proposed design with
existing architectures are presented in Table 2.
From the table, it can be clearly seen that the proposed
architecture is much smaller than other existing implemen-
tations. Additionally, this compactness does not come at
a performance cost. Through intelligent use of FPGA re-
sources and design principles, high performance has been
ensured. Since it is difficult to make meaningful direct
comparisons of different platforms for throughput due to
different device architectures and generational gaps, a term
Throughput/MHz term has been introduced in order to nor-
malise the amount of useful work done per system clock.
It is clear from the table that the proposed architecture is
the most efficient one in this regard. Additionally, it is
worth mentioning that although the architecture in [7] has
been implemented on the Virtex-5 which is four generations
ahead of the Virtex-E FPGA that has been used in this pa-
per, the work in [7] achieves only a 52% speedup which is
primarily due to the increased clock speed achieved by the
Virtex-5 due to architectural improvements. However, the
inefficiency of this architecture is clearly demonstrated in
the last row of Table 2.
6 Conclusion
In this paper a high performance FPGA implementation
of the MT19937 algorithm has been implemented. Block
level parallelism and word level pipelining have been ex-
ploited to improve the performance. The proposed archi-
tecture has the smallest core footprint of all architectures
in print and provides the maximum amount of performance
per MHz of the platform used. This architecture is fully
modularised and can be directly used as a PRNG for Monte
Carlo methods.
Table 2. Comparison of Design Parameters
with Existing Architectures
Design Proposed [9] [7] Reference [3]
Platform Virtex-E N/A Virtex-5 S/W S/W
Area (slices) 330 420 2028* N/A N/A
Freq. (MHz) 24.234 N/A 111 2× 1833 N/A
Thru’put (T) 24.16 38.41 37 5.26 10.75
T/MHz 0.998 N/A 0.333 0.14 N/A
* The actual number of slices occupied on the Virtex-5 is 1600.
However, the Virtex-5 has a 5 LUT per slice architectures un-
like all previous generations of XIlinx FPGAs. Hence, a conver-
sion factor is introduced to make the area occupied comparable.
Throughput (T) is in Million numbers/second. T/MHz represents
the Throughput per million clock cycles.
References
[1] RC1000 development platform product brief. Datasheet v1.1,
Celoxica Ltd., August 2002.
[2] Virtex-E 1.8 V Field Programmable Gate Arrays. Datasheet
DS022-1 (v2.3), Xilinx Inc., July 2002.
[3] N. Karaoglu. Mersenne twister - a study on random number
generators.
[4] D. E. Knuth. Seminumerical Algorithms, volume 2. Addison-
Wesley, Reading PA, 2 edition, 1981.
[5] M. Mascagni, S. Cuccaro, D. Pryor, and M. Robinson. A fast,
high quality, reproducible, parallel, lagged-fibonacci pseudo-
random number generator. Technical Report SRC-TR-94-
115, Supercomputing Research Center, 17100 Science Drive,
Bowie, MD 20715, 1994.
[6] M. Matsumoto and T. Nishimura. Mersenne twister: a 623-
dimensionally equidistributed uniform pseudo-random num-
ber generator. ACM Transactions on Modeling and Computer
Simulation, 8(1):3 – 30, January 1998.
[7] D. Pellerin, E. Trexel, and M. Xu. Fpga-based hardware ac-
celeration of c/c++ based applications. Impulse Accelerated
Technologies, August 2007.
[8] C. P. Robert and G. Casella. Monte Carlo Statistical Methods.
Springer-Verlag, 1 edition, 1999.
[9] V. Sriram and D. Kearney. An area time efficient field pro-
grammable mersenne twister uniform random number gener-
ator. In Proceedings of the International Conference on En-
gineering of Reconfigurable Systems and Algorithms, Las Ve-
gas, USA, June 2006. CSREA Press.
485
