Basic uniform pseudo-random number generators are implemented on ATI Graphics Processing Units (GPU). The performance results of the realized generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU and GPU are discussed. The obtained speed-up factor is hundreds of times in comparison with CPU. RANLUX generator is found to be the most appropriate for using on GPU in Monte Carlo simulations. The brief review of the pseudorandom number generators used in modern software packages for Monte Carlo simulations in high-energy physics is present.
Introduction
The development of the General Purpose computing on Graphics Processing Unit (GPGPU) technology has opened recently a new cheap alternative to supercomputers and cluster systems for researchers. The primary role of the GPGPU technology promotion justly belongs to nVidia company which is the one of the two main GPU hardware developers. nVidia motto is "One Researcher, One Supercomputer". It fully reflects the tendencies of the computational systems present-day market. Moreover, the systems which providing the GPGPU technology become the integral parts of the contemporary supercomputers. In particular, while assembling new supercomputer TH-1 with peak productivity 1.206 petaflops which is equipped with the ATI GPUs HD 4870X2 on every node, China became the third country to build a petaflop supercomputer (after USA and Germany) [1] .
Pseudo-random numbers generation algorithms are key components in most modern packages for researches and result depends on their characteristics. The package should provide the actual investigation of the physics or the other simulated problems, but not to explore the behavior of the generator itself. Thus, while choosing generator, one should consider not only its performance productivity but also its statistical properties. Checking the key results with generator of a different class than used one is also very important [2] . "Pseudo-random" term actually means that for such values always exist an algorithm which may reproduce the whole sequence. The values, produced by some physical process, cannot be considered as random in a general sense, because even if we cannot predict the sequence of such numbers, it does not mean that there is no algorithm to produce them [3] .
The main aim of the present paper is to show the possibility to adapt for GPU the most commonly used pseudo-random number generators (PRNG) as the main components of the software packages for Monte Carlo (MC) simulations. An essential distinction of the GPU architecture from the Central Processing Unit (CPU) architecture causes new difficulties as well as new optimization capabilities. The question about the performances of the different PRNGs on GPU platform is arisen in this connection. The performance results and the difference between PRNGs performances on CPU and GPU will be shown in the paper.
We will restrict our investigation only to uniform PRNGs. In particular some generators were already investigated for nVidia GPUs in [4, 5] . But studied generators are not the most frequently used ones in MC simulations.
In the present paper we will not refer the question of the random-number generator testing (see for example [6] and references therein) and will content only with implementation of existing generators on GPU.
The paper is organized as follows. Section 2 contains the brief list of the software packages for MC simulations in high-energy physics with showing PRNGs which are used in the corresponding package. The details of the GPU implementation and the code listings of the realized PRNGs are described in section 3. The performance results and discussion are collected in section 4. In section 5 we draw our conclusions. Finally, appendix includes some theoretical background for the realized PRNG algorithms.
PRNGs in Monte Carlo simulations
In this section we present the list of PRNGs which are employed in nowadays MC simulations.
To choose a generator which will be used in MC simulations, it is not enough to consider only its period and algorithm output. The generator R250 was widely used due to its simplicity and long period, but it was found that this generator has the essential statistical defects which make impossible to use it in modern simulations (see [9] and reference therein). Apart from general statistical tests like DIEHARD Battery of Tests of Randomness and Crush, a generator has to pass empirical test in real conditions. That is why, while developing MC application, usually only generators with minimal statistical influence for the present MC simulations are used. For example, European Organization for Nuclear Research (CERN) recommends using three PRNGs: RANMAR (V113), RANECU (V114) and RANLUX (V115).
Below is the incomplete list of the software packages for MC simulations in high-energy physics and the PRNG which is used in corresponding package.
• FermiQCD: is an open C++ library for development of parallel Lattice Quantum Field Theory computations [32, 33] . It uses floating-point version of RANMAR generator as default PRNG.
• UKQCD: By indirect information UKQCD Collaboration also uses RANMAR PRNG [34] in its software.
• MILC: is an open code of high performance research software written in C for doing SU (3) lattice gauge theory simulations on several different (MIMD) parallel computers [35] . MILC uses own XOR of a 127-bit feedback shift register and a 32 bit integer congruential generator. Each node or site uses a different multiplier in the congruential generator and different initial state in the shift register. So, all nodes are generating different sequences of numbers. t = (((X n−5 ≫ 7) OR (X n−6 ≪ 17)) XOR
((X n−4 ≫ 1) OR (X n−5 ≪ 23))) AND (2 24 − 1), X n−6 = X n−5 , X n−5 = X n−4 , X n−4 = X n−3 , X n−2 = X n−1 , X n−1 = X n , X n = t, Y n = a j Y n−1 + c, z n = (t XOR ((Y n ≫ 8) AND (2 24 − 1)))/2
24
• CPS: The Columbia Physics System is a large set of codes for lattice QCD simulations [36] . RAN3 is used.
• SZIN: is the open-source software system supports data-parallel programming constructs for lattice field theory and in particular lattice QCD [37] . RANLUX is used in SZIN packet.
• ISAJet: is a Monte Carlo event generator for pp, pp, e + e − interactions [38, 39] . RANLUX is incorporated into ISAJet.
• GEANT4: is a toolkit for the simulation of the passage of particles through matter [40] . GEANT4 uses the HEPRandom module [41] to generate pseudo-random numbers which includes 12 different random engines (RAN-MAR, RANECU, DRAND48, RANLUX, etc) now.
• PYTHIA: is a program for the generation of high-energy physics events [42, 43] . RANMAR is used in PYTHIA as an internal PRNG.
• HERWIG: is a Monte Carlo package for simulating hadron emission reactions with interfering gluons [44, 45] . RANECU is used.
• CompHEP: a package for evaluation of Feynman diagrams, integration over multi-particle phase space and event generation [46, 47] . DRAND48 is used in CompHEP.
• MC@NLO: is a Fortran package to implement the scheme for combining a Monte Carlo event generator with Next-to-Leading-Order calculations of rates for QCD processes [48, 49] . GGL is used.
• SHERPA: is a Monte Carlo event generator for the simulation of highenergy reactions of particles in lepton-lepton, lepton-photon, photon-photon and hadron-hadron collisions [50, 51] . RANMAR is used.
• Chroma: is a software system for lattice QCD calculations [52, 53] . In Chroma slightly modified linear congruential generator RANNYU is implemented -LCG(a = 31167285,m = 2
48
).
• GENIE: is a neutrino event generator for experimental physics community [54, 55] . MT19937 is used.
• ALPGEN: is an event generator for hard multiparton processes in hadronic collisions [56, 57] . RANECU is used.
So, the most commonly used generators are RANMAR, RANLUX, RANECU and several variations of a linear congruential generator (DRAND48, RAN3, GGL, RANYU). Also, most new packages began to include the generator MT19937 as internal PRNG. Hence, it is reasonable to implement the generators on GPU which are used in real MC simulations. 
GPU implementation
In this section we give the base ideas for PRNG implementation on GPU and describe the source codes of the following PRNGs realizations on ATI GPUs: GGL, XOR128, RANECU, RANMAR, RANLUX and MT19937.
We use ATI Stream SDK [61] for the realization of the PRNGs on ATI GPU as software environment. ATI CAL allows using ATI GPU hardware in the most effective way [58] . That is why all PRNGs realizations presented below are made on ATI Intermediate Language (IL) [59] , and not on higher level, for example OpenCL or Brook+.
Three different ATI video cards are used to check the algorithm efficiency -ATI Radeon HD 5850, HD 4870 and HD 4850. The essential for GPGPU-applications parameters about some ATI's video cards are presented in Table 1 . All cards are equipped with the GDDR5 memory except HD 4850 which contains GDDR3 memory (this fact was marked with the asterisk). GDDR5 memory possesses a quadruple effective data transfer rate relative to its physical clock rate, instead of double as with GDDR3 memory. It is necessary to note that HD 4870X2 and HD 5970 are two-core cards. For our purposes at program level they are equivalent to two devices installed in the system.
General implementation scheme
To design the GPU-applications it must be accounted the following main features of hardware architecture:
• each general-purpose register and memory cell has the four 32-bit components that are designated as .x, .y, .z and .w;
• floating point operations are more productive on GPU than integer operations (in compare with CPUs);
• double precision floating point operations are the slowest on GPU;
• ATI GPU can perform up to five operations on each VLIW processor simultaneously.
Computing programs working for GPU are known as kernels. in GPU-memory, for example, for further usage by other procedures, as it is usually made in CPU-simulations. However, it seems to be more effective to "virtualize" produced by PRNG pseudo-random numbers as it allows to eliminate unnecessary additional read-write operations to the GPU-memory for the random numbers as well as to decrease the GPU-memory consumption of the application. In order to accelerate GPU-applications work, we recommend to keep all data needed for kernel operations directly in GPU-memory, because memory operations are a bottleneck of GPU.
For MC simulations on GPU it is convenient to produce four random numbers through one PRNG pass which is closely related to GPU memory architecture. Under such conditions one can use GPU performance in the most efficient way. As a rule more than one random number are used in MC simulations for updating (for example, one for update proposition and one for probability). So, it seems to be natural to generate the corresponding number of the pseudo-random numbers for further purposes.
The common structure scheme of PRNGs is shown in Figure 1 . All implemented PRNGs are divided into 3 phases:
• initialization: loading and preparing the previous state of PRNG and all preparing operations for PRNG;
• random number generation: actually all PRNG operations which generate pseudo-random numbers;
• finalization: storing the final state of PRNG for the next run.
It allows to make the best use of PRNG, as it is needed a few more random numbers. For example, at multihit updating method the several pseudo-random numbers are required per one pass of the updating procedure. The initialization subroutine of the PRNG is called on the start of the updating procedure. During the work the updating procedure may call the PRNG generation subroutine many times. And finally, at the end it must call the finalization subroutine of the PRNG. All PRNG lag tables are stored directly in GPU-memory and it avoids needless transfer the data between CPU and GPU memory. Each instance of the PRNG uses its own lag table to parallelize the process. The universal method of the paralleling is used in all realized PRNGs -using the independent sequences. To be exact, every We use a little bigger number of actual PRNG instances than the number of the stream cores in particular GPU to avoid possible collisions among threads. For example, top one-unit ATI GPU card HD 5870 has 1600 stream cores (see Table 1 ), while 13 lowest bits were used to identify the lag table for PRNG. It corresponds to the 8192 parallel instances of the PRNG.
Almost all generators require the initialization procedure to bootstrap the lag table (three of presented here generators, RANMAR, RANLUX and MT19937). This procedure is realized directly on the GPU unit, not with the help of the CPU with further transfer the lag table into GPU-memory. We do not show lag table initialization procedure here to save the space in the article.
For convenience and performance purposes all the kernels have been precompiled to replace all constants %VariableName% with the corresponding values. It allows to avoid the constant buffer using and put the runtime constant parameters directly in assembler code. Under %VariableName% the value of the respecting variable will be implied in source codes below. Sometimes before the VariableName it will stay the prefix i (like %iVariableName%) which will show the decimal format instead of the default hexadecimal format. Decimal format is needed to specify the relative offsets in global memory operations.
Note that each of the presented generators can be easily modified for the generation of the pseudo-random numbers with double precision. All presented generators are either 24-bit, or 32-bit, while for number representation with double precision 64 bits are required. Thus, for correct generation pseudo-random numbers with double precision one should use several numbers with single precision, but not only covert a number with the single precision into the double precision number by the regular conversion command (it considerably decreases the quantity of possible realization).
GGL
GGL is one of the simplest and computational "light" portable PRNG which could be implemented on GPU. GGL has been studied by Langdon on nVidia Tesla T10P with CUDA SDK [5] . The obtained peak performance of the PRNG is about 2 × 10 9 pseudo-random numbers per second. It is obvious that while period P GGL ≤ 2147483646 and its run in 1024-threads and output 10 9 pseudo-random numbers per second the GGL period could be exhausted in about 0.002 second. So, GGL are realized here just to check the performance of the pseudo-random number production on ATI GPUs on a very simple generator which requires to store only one seed value per PRNG instance.
Park and Miller have published the four Pascal implementations of the GGL [12] , for integer and real arithmetic (one direct scheme and one avoiding the overflow). For implementation we chose the "Integer version 2" of the GGL [12] . The ATI IL source code of GGL PRNG is presented in Figure 2 . There are three subroutines 10, 11 and 12 after the main module. Subroutines 10 and 12 are called only once and perform the initialization and the finalization of the PRNG, correspondingly. Subroutine 11 produces four-component pseudo-random numbers in register R200 per one pass. This program scheme fully corresponds to the basic scheme, described in subsection 3.1, and hereinafter will be used for the rest of generators presented in this work.
The used variables are:
PMSEED_MASK specifies the PRNG instance and iIPMSEED_START is the offset of the lag table in the global buffer which is prepared by the host program.
The parallelization scheme with separate sequences is implemented here -every GPU thread produces four separate pseudo-random sequences (by every slot of general purpose register R200). So, the whole number of the pseudo-random sequences is the quadruple number of the threads to be run. The threads must be initialized carefully to reach the maximal period length of the PRNG and avoid the sequences overlapping.
Generator requires to keep only one seed-value per thread for work which is equal to choosing only one four-component cell per thread in global memory which is read and kept only once per PRNG cycle. Thus, for 4096-thread run (or 16384 subthreads) the size of the lag table will be 64kB. Initial seeds are filled with the host program and transferred to GPU memory before the first run of the GGL. Seed values must be exactly chosen to rich the maximal period of PRNG. But in our case the performance of the generator is the main aim, so all the seeds are selected randomly, because the generator exhausted whole the period for very short time and it is not so important when and where the overlapping of the sequences will began.
The other LCGs could be easily realized on the given example of the PRNG by substituting the corresponding LCG parameters. GGL generator possesses very good performance, as it will be shown in the next section. Nevertheless, in spite of the PRNG performance, it would not be used for practical purposes due to very short period.
XOR128
Next PRNG implemented on ATI GPU is XOR128, a very fast generator with much better statistical properties and considerably longer period than GGL. The distinctive feature of this PRNG is the usage of the four-component 32-bit values to produce the sequence which exactly corresponds to bit-capacity of GPU memory.
The ATI IL source code of XOR128 is shown in Figure 3 . The following variables are used
XRSEED_MASK specifies the PRNG instance, iIXRMSEED_START is the offset of the lag ishl r202,r201,l200.yyyy ixor r202,r202,r201 ushr r203,r202,l200.wwww ixor r203,r202,r203 ushr r201.x,r201.w,l200.z ixor r201.x,r201.x,r201.w ixor r201.x,r201.x,r203.x ushr r201._y,r201.x,l200.z ixor r201._y,r201.y,r201.x ixor r201._y,r201.y,r203.y ushr r201.__z,r201.y,l200.z ixor r201.__z,r201.z,r201.y ixor r201.__z,r201.z,r203.z ushr r201.___w,r201.z,l200.z ixor r201.___w,r201.w,r201.z ixor r201.___w,r201.w,r203.w utof r200,r201 div r200,r200,l201.xxxx ret_dyn endfunc The number of sequences produced by XOR128 corresponds to the number of threads. One thread generates four successive pseudo-random numbers from one sequence in every components of the general purpose register R200. Marsaglia's algorithm [25] is slightly modified to parallelize sequence production into four subthreads -four items of the lag table are composed in one four-component memory cell and are produced in one pass. Unfortunately, it is impossible to avoid recursion without making the algorithm more complicated. Therefore among the four-component operations in the source code there are single-slot operations which are partially parallelized further by the IL compiler.
Except the final integer-to-float conversion operations in the algorithm, there are only bit shift and exclusive-OR operations which are "computationally light" GPU operations in compare with integer operations. Along with few memory operations (one read and one written operations per run), it also increases the performance of PRNG on GPU.
The XOR128 generator has a period large enough to be used on GPU. For 2048-threads run and average performance about 10 10 samples per second it could be exhausted in about 10 17 years only. And only strong criticism of L'Ecuyer [26] does not to allow use XOR128 as standard PRNG on GPU.
Generator requires keeping four 32-bit integers per thread. In the present realization PRNG produces four sequential pseudo-random numbers per thread which allows reserving only one 4-component cell per thread in global memory. This 4-component seed is read and written only once per PRNG cycle. The size of the XOR128 lag table is the same to GGL, i.e. for 4096-thread run it takes the 64kB of the GPU memory.
RANECU
Another high-performed PRNG realized on ATI GPU is RANECU. This generator is recommended by CERN and used in some software packages, listed in the previous section. Relatively long period and quite good statistical properties make RANECU reasonably attractive generator for small tasks.
The scheme like in the case of GGL generator is implemented here: each thread produces the four independent sequences which are composed into the four components of the general purpose register R200. For RANECU run in 1024-threads and output 5 × 10 9 pseudo-random numbers per second the RANECU period could be exhausted in about 31 hours. Despite this fact the generator is still acceptable for iIRESEED_START and iIRESEED_START2 are the offsets of two tables of seeds in global buffer, RESEED_MASK specifies the PRNG instance. In fact, the RANECU lag table is divided into two tables for convenience here. These tables are prepared by the host program. The generator requires keeping two integer seed-values per thread which are grouped into two lag subtables. So for 4096-thread run the size of the lag table is 128kB.
The generator kernel is made on integer arithmetic base which slightly brings down the generator performance while using GPUs. However, simplicity of the algorithm and small amount of the lag table elements completely compensate this "drawback". On the base of the present code, one can easily construct another MRG by substitution of the corresponding parameters.
RANMAR
The next generator realized on ATI GPU is the 24-bit Marsaglia PRNG RANMAR. It was previously implemented on ATI GPU in [58] for Ising model and SU (2) gluodynamics simulations.
In contrast to previously presented PRNGs, RANMAR has a larger lag table which contains 97 elements. All these lag table items must be prepared before the first working pass of the generator. Of course, the lag table may be directly initialized by the user, but it is not convenient in practice. So, the RANMAR lag table is initialized by stand-alone procedure RMARIN, proposed by James [23] . In this procedure, the whole lag table is initialized on the base of only two given 5-digit integers, each set of which causes an independent sequence of the sufficient length for an entire calculation. The seed variables can have values between 0 and 31328 for the first variable and 0 and 30081 for the second variable, respectively. RANMAR can create, therefore, 900 million independent subsequences for different initial seeds with each subsequence having a length of about 10 30 pseudo-random numbers. This approach considerably reduces the number of the possible generator states. Still it brings an important element of the generator features -easy division of sequences produced by generator among PRNG instances without overlapping. Possible sequences quantity (900 millions) at existing or developing hardware is considered to be sufficient even in medium-term perspective.
The generator consists of two parts:
• kernel which produces the seed numbers on initial seed values; in fact this kernel is a replica of the RMARIN subroutine of the James' version [23] ;
• subroutines which directly produce the random numbers.
First part is executing only for initialization. The floating-point version of the RANMAR generator is realized here which produces the pseudo-random directly in the interval [0; 1). So, it is not needed to use slowest integer-to-floating point converting operation.
Apart from 97 elements in the lag table each RANMAR instance must store previous value of arithmetic sequence and two indices which are connected to each other. As in the case of GGL PRNG, every GPU thread produces four independent sequences in the presented implementation of RANMAR. Obviously at such approach, it is necessary to keep only one pair of indices for all four subthreads, because these subthreads are executed out synchronously. So, it requires 2.5 memory cells be read and written for one PRNG pass. The size of RANMAR lag table for 4096-thread run is about 6MB. Initially lag table is prepared by stand-alone procedure RMARIN which is not shown here.
The ATI IL source code of RANMAR PRNG is presented in Figure 5 . The following variables are used
RMSEED_START is the offset of the lag table in global buffer, RMSEED_MASK specifies the PRNG instance.
RANLUX
Nowadays RANLUX PRNG is one of the standard high-performed generators for Monte-Carlo simulations. The statistical properties of the generator are well-known. From the realization point of view, distinctive feature of the generator is the necessity to discard out groups of generated pseudo-random numbers after one generation cycle. Omitted values quantity is determined by the "luxury" parameter. While implementation of the algorithm on the central processing unit such discarding is "virtual", because lag table fits in processor cache very well, as a rule. Still this algorithm phase is very resource-intensive on GPU, because global buffer is not a generally cached object. Thus it is obvious, that the PRNG performance is strongly depend on this phase of algorithm realization.
To implement the RANLUX generator, we use three quite different approaches. First of all, the direct translation of algorithm is performed. In this scheme all the seed values are updated directly in GPU global memory. Every thread in point of fact generates simultaneously the four independent sequences of the pseudorandom numbers. Luxury is performed for all four subthreads at the same time. This approach makes the algorithm considerably simpler, but does not allow getting the best performance.
Next evident approach is to use the indexed temporary array for luxury operation. It allows to make process execution much faster: at luxury level=3 3.5 times faster and 5 times faster at luxury level=4, keeping the complexity of the algorithm meanwhile.
Third approach which really allows to make algorithm faster and in practice minimize the dependence of the execution time on luxury level, is the "planar" scheme. The geometry of the lag table is taken into account as much as possible in this scheme. For RANLUX it consists of the 24 elements which naturally could be grouped into six 4-component 32-bit registers. The new difficulty to vectorize the RANLUX algorithm is arisen due to the necessity of the recursive calculation of the carry bit c n which depends on the preceding states of the generator. Therefore, some serial operations are appeared in planar RANLUX code as well as in presented here XOR128 implementation. Planar RANLUX procedure produces four pseudo-random numbers from one sequence for one pass of the generator.
Base algorithm RANLUX requires discarding 24, 73, 199 and 365 values after one RANLUX cycle for luxury level 1, 2, 3 and 4, respectively. However to perform luxury in the planar implementation of the RANLUX, it is convenient to discard some larger number than ones, proposed by Lüscher [30] . Strictly speaking, it is convenient to discard the number of the values which are multiply by 24. In CPU implementation such approach seems to be redundant (see [30] ), but on GPU it shows better performance. Thus, in planar scheme the following numbers are discarded: 24, 96, 216 and 384 (for luxury levels 1, 2, 3 and 4, respectively).
The ATI IL source code of planar RANLUX PRNG is presented in Figure 6 . The following variables are used
RLSEED_START is the offset of the lag table in global buffer, RLNSKIP is the number of generated values to be discarded (is defined by the luxury level), RLSEED_-MASK specifies the PRNG instance. Due to relatively large lag Figure 6 : Source code of RANLUX PRNG not take much resources, so its listing is not presented here. Also, it may be realized in the host program with further copying the result to the GPU global buffer.
Generator RANLUX, as well as RANMAR PRNG, possesses the important feature in the context of the GPU realization -the generator kernel is build on floatingpoint arithmetic. Each instance of the planar version of the RANLUX requires storing seven 4-component cells only (three indices which are connected to each other, 24 items of lag table and carry bit). So for 4096-thread run the size of the lag table is 448kB (or 1664kB for the case of the global buffer and temporary indexed array RANLUX versions).
Mersenne Twister
The last PRNG which is implemented in this work is MT19937, one of the Mersenne Twister generators family. This generator seems to be very attractive nowadays due to its extremely long period and relatively easy algorithm.
Mersenne Twister is incorporated in nVidia SDK as sample. The realized in nVidia SDK version of the Mersenne Twister contains 19 element lag table and period about 2
607
. It is easy to implement the MT19937 generator on the basis of this example by substituting the relevant parameters.
In the presented realization of MT19937 we use the same scheme as in the planar implementation of RANLUX. Whole 624-element lag table is located into 156 four-component cells. So, for the one pass of the PRNG it is easy to obtain four sequential pseudo-random numbers. But unfortunately, it is impossible to store whole lag table in the general-purpose registers to perform the lag table updating procedure, because the total number of the general-purpose registers allocated for one thread is 128. Thus, all operations with the lag table are realized with the slow direct access to the global buffer, what greatly reduces the total performance of the generator. Nevertheless, the using of the four-component elements somewhat compensates this disadvantage.
The ATI IL source code of MT19937 PRNG is presented in Figure 7 . The following variables are used
MTSEED_START is the offset of the lag table in global buffer, MTSEED_MASK specifies the PRNG instance. The lag table of the MT19937 is the biggest one among the lag tables of the presented generators. For 4096-thread run its size is about 10MB. MT19937 also requires the initialization procedure which prepares the lag table for the first run.
Performance results and discussion
For all PRNGs implemented here we use the MS Visual Studio 2008 Express edition (C++ compiler) [63] . Original codes are presented in corresponding literature (see Section A) and in several cases they are translated into C++.
The CPU implementation of the PRNGs is constructed on the following common scheme: each PRNG is implemented as subroutine which produces only one pseudorandom number per call. Then the main program sums up all produced values and checks the elapsed time. 14 iadd r201.y,r201.x,r202.x fence_memory mov r200,g[r201.y] ushr r215,r200,l202.xxxx ixor r200,r200,r215 ishl r215,r200,l202.yyyy iand r215,r215,l203.xxxx ixor r200,r200,r215 ishl r215,r200,l202.zzzz iand r215,r215,l203.yyyy ixor r200,r200,r215 ushr r215,r200,l202.wwww ixor r200,r200,r215 utof r200,r200 add r200,r200,l203.zzzz div r200,r200,l203. R a n M a r G G L R a n E c u M T 1 9 9 3 7 R a n L u x 3 P R a n L u x 0 P R a n L u x 1 P R a n L u x 2 P R a n L u • Intel Celeron CPU 420 @ 1.60GHz (L1 cache 32KB, L2 cache 512KB), 1.5GB RAM DDR2 333MHz Dual (5-5-5-15) Command rate 1T, which correspond to the middle-level and entry-level computers, respectively. The CPU performance results are presented in Figure 8 and Table 2 . It could be seen that the differences among the performances of the generators are about 5-6 times. XOR128 and RANMAR show the best performance results. Under CPU performance here and after we mean the performance of the one-thread instance of the algorithm. Of course, one-thread run does not provide maximal system utilization, however it allows comparing the potential performances of the systems. To show impartial assessment we can just multiply CPU performance by the quantity of the threads, supported up by peculiar processor, because it is always possible to run several PRNG instances to produce different independent pseudo-random sequences.
All the PRNGs implementations on GPU are carried out in ATI Intermediate Language [59] (ATI Catalyst 10.1 display driver is used [62] ). The host environment is also realized in MS Visual Studio 2008 C++ [63] . All used here GPUs are the main GPU devices installed in the system (i.e. they also provide visualization for the operational system) which lowers down the maximal performance of the system, but reflects more precisely the usual configuration of the GPU computational system. To obtain the performance of each PRNG, we run them up to 1000 times each to produce 4 × 10 7 pseudo-random samples on every pass. Each produced pseudorandom number is stored in global buffer. Elapsed time is averaged over several executions. The time spending to copy the initial input seeds to GPU memory and final mapping the GPU memory into host memory is not taken into account.
The performance results are collected in Figure 9 and Table 2 . The first column of the Table 2 contains the name of the implemented generator. The number pseudorandom numbers which could be produced by corresponding PRNG per one second on corresponding GPU are show in the columns 2-4. Here HD5850, HD4870 and HD4850 are the ATI Radeon video cards. The next two columns contain the same information obtained on two mentioned CPUs. The last column shows the speed-up factor of the using the ATI Radeon HD5850 in compare with the using of the Intel Core 2 Quad CPU Q6600 at 2.40GHz.
Memory access is a bottleneck of the GPU-applications. It is confirmed once again by different PRNG performance results on different ATI GPU hardware. The PRNGs with the greater number of the memory operations demonstrate the worst performance results.
In the present realizations GGL and XOR128 generators require to keep only one 4-component seed-value per thread. This directly influences their work -both generators showed the best productivity. Despite the generator XOR128 in the present code has sequential part which slightly lowers the speed of the generator operation its performance turned out to be the highest. It may be explained by the fact that only bit operations are used in the arithmetic kernel of the XOR128 generator which along with floating-point operations are the fastest implemented on ATI hardware. Generator GGL is realized on the base of the integer scheme of Park and Miller [12] which slightly brings down its output in compare with XOR128, in spite of less quantity of the intermediate operations.
The RANECU generator already requires keeping two seed-values for its operation. Along with using integer arithmetical operations in RANECU kernel it somewhat lowers its performance in compare with GGL and XOR128 generators. The algorithm simplicity, the performance up to 5 × 10 9 samples per second and the considerably better statistical properties allow extensively using RANECU generator on GPU. In order to increase the RANECU period it is reasonable to employ the MRG based on the combinations not two but three MLCGs. However this extension requires the additional study of the new generator statistical properties. It may be a subject of the further research in this field.
The RANLUX generator shows considerably high performance on GPU. To a significant degree it may be explained by fortunate matching of GPU architecture and the generator parameters. It is managed to minimize generator dependence X O R 1 2 8 R a n M a r G G L R a n E c u M T 1 9 9 3 7 R a n L u x 3 P R a n L u Thereby it is possible to use maximal luxury level with the minimal performance penalty in practical applications. In the opposition to CPU realization where the difference between the performances at luxury level=0 and luxury level=4 reaches 6 and more times, on GPU this parameter is only 10-23% (for different hardware).
It is obviously that performance advantage also depends on the way of algorithm realization. In order to demonstrate this fact, we showed in Table 3 the performance values of different realization of the same algorithm for RANLUX generator for a number of GPUs. RANLUX implementation through the global buffer is the slowest one. In fact the present realization of algorithm entirely reproduces the dependence of the GPU realization on luxury level. Acceleration in compare with CPU is achieved only because of such GPU parameters as number of stream cores and engine clock rate. An unexpected result was the fact that realization of for global buffer realization of RANLUX HD4870 GPU shows better performance than HD5850. It is closely related to the fact that memory data access in the kernel is organized not in the best way and a lot of time is spent to synchronize memory access operations. Obtaining better performance is possible by lowering the instructions density which refer to the global buffer.
Second RANLUX realization, through the indexed temporary array, is a bit faster than the first one. But the same situation with HD4870 and HD5850 GPUs performances can be observed here as well. The present realization has sense only for luxury levels 2-4, when time share, which is spent on indexed temporary array preparation, is compensated by luxury operation saving time itself. The main strong feature of this implementation is the complete correspondence to the classic algorithm, published in [31] .
The last presented implementation of RANLUX (which is called planar here) possesses the best performance due to maximal reduction of memory access opera- tions quantity. General advantage of performance on GPU in compare with CPU is up to 322 times (highest luxury level = 4). The modified algorithm is used here (see section 3.6) at which a bit higher quantity of pseudo-random numbers after one PRNG cycle are discarded away for decorrelation of lag table elements, than it is set up in the classic algorithm.
In the first two RANLUX realizations of the algorithm, it requires to read four memory cells (not accounting for the luxury procedure operations) and write three cells (two lag table items, carry bit and indices) to obtain one pseudo-random number (updated lag table item, new carry bit and new indices). In planar scheme the carry bit occupies one of the four-component cells with the indices, so it needs less quantity of reading and writing operations (four read and two writes).
The performance of the shown code of the MT19937 generator is turned out to be rather low. The main factor is the size of the lag table, the update of which is required after each PRNG cycle. Planar scheme applied here allows increasing the generator performance roughly in four times in compare with the direct four-threads realization, but the quantity of the memory access operations remains rather high which does not allow to achieve acceptable results. It is possible to reduce the lag table size by choosing another generator from the Mersenne Twister family. But from one hand, MT19937 generator parameters allow to realize the planar scheme (proved to be a good here), and from another hand there is a task to build the analogue of the actually used generators on GPU while algorithms realization. Although to generate one pseudo-random number it is needed only two read operations and one write only, the lag table update procedure drastically decelerates the generator.
The worst performance is shown by RANMAR generator which differs by the large enough lag table and high memory access operation density (four reads and three writes).
All presented generators realizations allow to produce at slight changes pseudorandom numbers with double precision either by using the pairs of the generated numbers or directly through the double precision conversion. The last method is not fully correct, because it considerable lowers the quantity of the possible realization of the obtained double precision number.
Paying attention to the XOR128 performance one can asserts that LЎEcuyer generator Seven-XORShift [26] is seemed to be very promising for realization on GPU due to the similarity to the XOR128 structure except for the eight-element lag table.
The periods of the XOR128, RANLUX, RANMAR and MT19937 generators are considered to be unachievable in medium-term perspective even for GPU clusters. Meanwhile the period of the GGL generator can be exhausted in a split second even on one relatively old GPU. So, it is necessary to pay special attention to the generator applicability borders while developing applications on GPU.
Undoubtedly the presented algorithms realization on the platform independent level such as OpenCL will be very important, but it goes beyond the present work frame and is a task for future research. The obtained results of the generators performances make it interesting to use the GPU for the investigation of the generators statistical properties.
Conclusions
In the present paper the most popular uniform pseudo-number generators which are used in Monte Carlo simulations in high-energy physics are realized on GPU. The list of the modern software packages with the indication of the generators used is represented. A theoretical background for pseudo-random number generation is described. The source codes of the implemented generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) are shown
The comparative analysis of the PRNG performances on CPU and ATI GPU is presented. The obtained speed-up factor is hundreds of times higher as compare to CPU. Performance analysis of the mentioned generators with taking into account their statistical properties allows to conclude that the most appropriative generator for Monte Carlo simulations on GPU is planar RANLUX with luxury level 4. Offered planar scheme makes it possible to increase significantly the performance of the most generators by the reducing memory access operations.
Offered PRNG program model (generator division into separate subroutines) also allows to obtain additional considerable gain of performance at the multiple calls of the PRNG generating subprogram. This is achieved by means of a substantial reduction in the number of the memory access operations per PRNG cycle.
In the present paper the generator performances in different realizations are analyzed by way of RANLUX PRNG example. It is shown once again that the memory access operations are the bottleneck of GPUs. It is possible to conclude that for GPU implementations of PRNGs it is better to choose algorithms with the minimal lag table size (which to be advisable multiply by 4) perhaps in spite of some complication of algorithms.
Potential of GPU implementation in Monte Carlo simulations is not fully realized nowadays. This is despite of large amount of works dedicated to this problem. GPU sector development dynamics makes it possible to predicate that this trend is one of the most promising nowadays.
A Basic classes of PRNGs
In this section we briefly provide the theoretical background for the main classes of the pseudo-random number generators. There are numerous books and reviews about the subject, so the details could be found in the references.
While pseudo-random numbers generation, there are only two internal sources of the randomness which could be used in the algorithm. They are the sequence itself (previous values in a sequence precisely) and the starting parameters (PRNG parameters and seed values). Actually, the algorithms of the PRNGs are distinguished by the way of the using these sources. So, particular PRNG is a certain function f which produces the next value X n ,
Here and below we will use upper index PRNG to identify the sequence produced by particular generator PRNG. The maximum period of the generator is the length of the cyclic sequences produced by PRNG and is limited by the number of the states that can be represented by PRNG. First simple PRNGs uses only one previous value X n−1 (r = 1) for generation, but in such scheme the PRNG period is limited by the bit capacity of the machine. If one uses the longer tables of the sequence of the previous values, from one hand it makes the generator period longer and its statistics properties better, as well as allows to simplify transformation function f for getting better output of the generator. From the other hand, longer tables require more complicated generator initialization and reduce its portability (the strict control of the architecture is needed, for example, the size of the cache memory), and it is obviously necessary to have much more memory to store such tables. A good generator is always a golden middle between algorithm complexity, the statistical properties of the generator and the size of the seed table.
According to the basic requirement for PRNG -repeatability, any starting state of the generator may cause only one specified sequence. The initialization of the seed table is often made by simpler generator which has lower requirements. Most of all linear congruential generators or their combination is used for initialization. So, it allows to set starting generator states by the limited set of the input seed values (often 1-2 numbers). But Marsaglia [3] possible values are given. Certainly, algorithm contains the possibility of the vector initialization. Nevertheless, only one seed is used to simplify the initialization.
The period of PRNG nearly always depends on its parameters and seed values. Thus, we always show only the upper bound for the value of the generator period.
A.1 Linear congruential generators
Linear congruential generators (LCG) is one of the oldest and most popular class of the PRNGs, which is widely used in computations in particular due to the encyclopedic work of Knuth [11] . It is based on so-called linear congruential integer recursion,
where increment c and modulus m are desired to be positive coprime integers (c < m) to provide a maximum period, multiplier a is an integer in the range [2; (m − 1)].
If increment c = 0 the LCG is often called the multiplicative linear congruential generator (MLCG),
The maximum period of LCG P LCG strongly depends on LCG parameters and
Demonstrative situation with poor choice of LCG parameters happened with infamous generator RANDU -MLCG(a = 2 16 + 3, m = 2
31
),
which suffers from three-point correlations among the sequential elements:
Another well-known LCG is the standard 48-bit generator DRAND48, which is a LCG(a = 25214903917, c = 11, m = 2
Park and Miller propose [12] a portable minimal standard Lehmer generator 
The total period of the GGL is relatively short, P GGL = (2 31 − 1) − 1 = 2147483646. One of the main well-known problems of the LCG is that every LCG produces n-tuples of uniform variates which lie in at most parallel hyperplanes [14, 15] . The other defects of the LCG are:
• the dependence of the generator period on initial seed;
• the influence of the chosen modulus on statistical properties of the pseudorandom sequence;
• lowest bits are not random.
And finally, the minimal integer value produced by MLCG is 1, not 0.
A.2 Feedback shift register generators
In 1965 Tausworthe [18] introduced a new generator based on bit sequence
where a i = {0, 1}, X i = {0, 1}, r ≤ n, is called linear feedback shift register algorithm (LFSR). The period of the LFSR is the smallest positive integer P for which
The next state could be obtained with the following transformation
or equivalently for initial state it is
The characteristic polynomial of r × r transformation matrix A
is
where I is an identity matrix. According to the theory of the finite fields [19] the maximal period P FSRG = 2 r − 1 is achieved if and only if the characteristic polynomial a(x) is an irreducible polynomial over Galois field GF (2). In other words, if and only if the smallest positive integer p: (x p mod a(x)) mod 2 = 1 is p = 2 r − 1 [7] . For computational efficiency, most of the a i in (10) should be zero. In GF (2) there is only one irreducible binomial, x + 1, which would yield an unacceptable period [8] . Consequently, the trinomials are usually used to express the recurrence sequence (10), X n = (X n−r + X n−s ) mod 2, s < r < n.
Addition modulo 2 for one-bit variables is ordinary binary exclusive-or operation XOR, so (16) may be rewritten as
LFSR recursion (10) produces pseudo-random bit sequence. To obtain k-bit pseudo-random integers Y n from such recursion, one can group up k sequential bits,
Such method is called the digital multistep method of Tausworthe [20] . Another method proposed by Lewis and Payne [21] is the generalized feedback shift register (GFSR). In GFSR scheme bits in the positions j of the pseudo-random integer are filled with the copy of initial one-bit recursion (17) which has a period 2 r − 1 with some nonnegative offsets d j ,
Clearly, LFSR is the particular case of GFSR. For example of GFSR it could be mentioned the R250 generator [22] . It is another infamous PRNG which causes the severe problems in the Monte-Carlo simulations of the two-dimensional Ising model using the single-cluster Wolff update algorithm (see [9] and references therein). For R250 the GFSR parameters are r = 250 and s = 103,
The R250 period is P R250 = 2 250 − 1 ≈ 1.81 × 10
75
.
A.3 Lagged Fibonacci generators
Lagged Fibonacci generators (LFG) is another class of the PRNGs. It is based on the well-known Fibonacci recurrence sequence
Because the simple Fibonacci generator is not very good [23] one always uses the generalized relation (21) with respect to any given binary arithmetic operation ⊙ and prehistory
where r and s are called "lags", r ≤ n and 1 < s < r. The LFG period P LFG for different operations ⊙ is
The main attractive features of the LFGs are long period, potential absence of conversion integer into float operations and simple recursive scheme which not requires the heavy mathematical operations. However, for generation LFGs it is needed to store r previous pseudo-random values.
There are numerous possible pairs of the LFG lags [11] . The larger lags lead to the decreasing of the correlations between the numbers in the sequence. But even for relatively short lag table r 20 it passes many statistical tests.
RAN3 generator [11] could be mentioned as an example of the LFG. It was proposed by Mitchell and Moore (unpublished), with lags r = 55 and s = 24, m = 10 9 and operation subtraction,
The RAN3 period is P RAN3 = (2 55 − 1)10 9 /2 ≈ 1.8 × 10 25 .
A.4 Combined generators
Combined generators are a special class of the PRNGs, which contains the features of the different PRNG classes. There are two main motivations to use combined generators:
• the period increasing of the generator,
• the improving of the generator statistical properties.
A.4.1 Multiple recursive generators
The obvious extensions of the LCG is the multiple recursive generator (MRG) [16, 17] which is determined as the combination of the MLCGs
When k > 1, MRG is usually called MRG of the k order. The maximal period of the MRG is P MRG ≤ m k − 1. In fact LFG is the special case of the MRG (for multipliers a i = 1 and c = 0).
By decomposition of the modulus of the MLCG into two terms m = aq + r and eqn.(4) may be written as following [16] X n = aX n−1 mod m = (a (X n−1 mod q) − ⌊X n−1 /q⌋r) mod m,
where ⌊a/b⌋ denotes the integer part of the (a/b) and
To provide the uniform distribution the combinations of l generators (26) may be combined as [16] 
Here the new index j in X j,n means the n-th value (26) of j-th generator. One of the possible MRGs is the RANECU generator [16] . It is the combination of two MLCGs (l = 2) with a 1 = 40014, m 1 = 2147483563 and a 2 = 40692, m 2 = 2147483399. The RANECU period is P RANECU = (m 1 −1)(m 2 −1)/2 ≈ 2.30584×10
18
. MRGs have good statistics and pass most the tests.
A.4.2 XORShift
XORShift PRNG, proposed by Marsaglia [25] , is another member of the GFSR generators class. Let X 0 be a some initial k-bit row-state of XORShift and T is k × k nonsingular binary matrix which sets linear transformation. The n-th PRNG state may be derived through the following equation
To ensure the performance requirements Marsaglia proposed the special form of matrix T ,
where matrices L and R are k ×k binary matrices which effect shift of one to the left and right, correspondingly.
In [25] Marsaglia lists all possible full-period triplets (a, b, c) for 32-bit (648 combinations) and 64-bit (2200 combinations) XORShift PRNG.
The maximal period of XORShift is
In spite of XORShift PRNG passes the DIEHARD Battery of Tests of Randomness [25] L'Ecuyer appoints that it "spectacular failed" the SmallCrush and Crush tests [26] . L'Ecuyer does not recommend to use this class of the generators, but proposes the own version of the XORShift implementation -Seven-XORShift.
Marsaglia gives an example of the XORShift generator for 128-bit vector with four 32-bit components -XOR128 PRNG [25] , (X n−3 , X n−2 , X n−1 , X n ) XOR128 = (X n−4 , X n−3 , X n−2 , X n−1 ) · 
X n−3 = X n−2 , X n−2 = X n−1 , X n−1 = X n , X n = (X n−1 XOR (X n−1 ≫ 19)) XOR (t XOR (t ≫ 8)).
The XOR128 period is P XOR128 ≤ 2 128 − 1.
A.4.3 Mersenne twister
One of the most "fashionable" modern PRNGs is the Twisted GFSR generator (TGFSR) or Mersenne twister generator [27, 28] . TGFSR is the modernization of the GFSR and its algorithm is based on the following recurrence for w-bit vectors X n X TGFSR n+r = X n+s XOR X upper n OR X lower n+1
A,
where superscript indices "upper" and "lower" denote the w − u highest and u lowest bits of a corresponding binary vector X i , respectively, 
So, only one w-bit vector a = (a w−1 , a w−2 , · · · , a 0 ) defines the product X i A,
if (X i AND 1) = 0 (X i ≫ 1) XOR a if (X i AND 1) = 1 .
For improving the statistical properties of the sequence so-called tempering procedure is applied for the output sequence X n . This procedure is defined with the t = X n XOR (X n ≫ m), By appropriate choosing of r, u parameters and binary vector a, it might reach the maximal period of the TGFSR,
The most famous implementation of the TGFSR PRNG is MT19937 [28] . The TGFSR parameters of the MT19937 are the following: w = 32, r = 624, s = 397, u = 31, 
Or equivalently the producing recurrence is,
X n = X n + m for X n < 0.
Here Y n and C n are (23)) and the period of the arithmetic sequence C n is P LCG(1,2 24 ) ≤ (2 24 − 1) (see eqn. (5)). Therefore, the total period of the RANMAR is P RANMAR 2 144 ≃ 2.23 × 10 43 .
A.4.5 Add-With-Carry and Subtract-With-Borrow generators
In 1991 Marsaglia and Zaman introduced a new class of PRNGs: add-with-carry (AWC) and subtract-with-borrow (SWB) [29] , which are small modifications of the LFG with respect to supplementing an extra carry or borrow bit. Due to the branching it became the first class of nonlinear PRNGs [8] . The AWC generator is described by the sequence X AWC n = (X n−r + X n−s + c n−1 ) mod m,
c n = 1 if (X n−r + X n−s + c n−1 ) ≥ m 0 if (X n−r + X n−s + c n−1 ) < m .
In addition to r seed values (X 1 , . . . , X r ) generator must be initialized with the carry bit c r . The maximal period of the AWC generator is
In the SWB case the subsequent values of the sequence are obtained by X SWB n = (X n−r − X n−s − c n−1 ) mod m,
c n = 1 if (X n−r − X n−s − c n−1 ) < 0 0 if (X n−r − X n−s − c n−1 ) ≥ 0 .
L'Ecuyer noted [10] that SWB has a second variant, in which indices r and s in (49) are swapped. The maximal period of the SWB generator is
The well-known example of the SWB is RCARRY [23] which underlies the RAN-LUX PRNG. The RCARRY parameters in (49) 
c n = 1 if (X n−24 − X n−10 − c n−1 ) < 0 0 if (X n−24 − X n−10 − c n−1 ) ≥ 0 .
The RCARRY period is about [29] P RCARRY ≤ (2 24 ) 24 − (2 24 ) 10 − 2 /48 ≃ 1/3 × 2 572 ≃ 5.15 × 10 171 .
It is less than the maximal period of the SWB (50) because modulus m = 2
24
is not a prime number.
A.4.6 RANLUX
Despite all advantages (extremely long period, portability and good productivity) AWC and SWB generators suffer from some statistical defects (a bad lattice structure), showed by Lüscher [30] . To eliminate these lacks James [31] implements the Lüscher's [30] idea to modify the SWB generator RCARRY. In the new generator which was called RANLUX (for LUXury RANdom numbers [31] ) after producing r = 24 pseudo-random values the p − r following sequential values are discarded. It might be used any values of p, but there are five generally accepted levels of the luxury every of which has its own value p [30, 31] ,
• level 0 (p = 24): complete equivalent to original RCARRY generator, there are no discarding values
• level 1 (p = 48): throws out 24 values after one generation cycle; considerable improvement in quality over RCARRY, passes the gap test, but still fails spectral test
• level 2 (p = 97): passes all known tests, but theoretically still defective
• level 3 (p = 223): default level, any theoretically possible correlations have a very small chance of being observed
• level 4 (p = 389): highest possible luxury, all 24 bits of the mantissa are chaotic.
Lüscher recommends [30] to use a default value p = 223 and notes that employment of values p > 389 is pointless.
