CPUs and operating systems are moving from 32 to 64 bits, and hence it is important to have good pseudorandom number generators designed to fully exploit these word lengths. However, existing 64-bit very long period generators based on linear recurrences modulo 2 are not completely optimized in terms of the equidistribution properties. Here, we develop 64-bit maximally equidistributed pseudorandom number generators that are optimal in this respect and have speeds equivalent to 64-bit Mersenne Twisters. We provide a table of specific parameters with period lengths from 2 607 − 1 to 2 44497 − 1.
INTRODUCTION
Monte Carlo simulations are a basic tool in financial engineering, computational physics, statistics, and other fields. To obtain precise simulation results, the quality of pseudorandom number generators is important. At present, the 32-bit Mersenne Twister (MT) generator MT19937 (with period 2 19937 − 1) (Matsumoto and Nishimura 1998 ) is one of the most widely used pseudorandom number generators. However, modern CPUs and operating systems are moving from 32 to 64 bits, and hence it is important to have high-quality generators designed to fully exploit 64-bit words.
Many pseudorandom number generators, including Mersenne Twisters, are based on linear recurrences modulo 2; these are called F 2 -linear generators. One advantage of these generators is that they can be assessed by means of the dimension of equidistribution with v-bit accuracy, which is a most informative criterion for high dimensional uniformity of the output sequences. In fact, 30:2 S. Harase and T. Kimoto MT19937 is not completely optimized in this respect. Panneton et al. (2006) developed the Well Equidistributed Long-period Linear (WELL) generators with periods from 2 512 − 1 to 2 44497 − 1, which are completely optimized for this criterion (called maximally equidistributed), but the parameter sets were only searched for the case of 32-bit generators. Conversely, there exist several 64-bit F 2 -linear generators. Nishimura (2000) developed 64-bit Mersenne Twisters, and the SIMDoriented Fast Mersenne Twister (SFMT) generator (Saito and Matsumoto 2009 ) has a function to generate 64-bit unsigned integers. For graphics processing units, Mersenne Twister for Graphic Processors (MTGP) (Saito and Matsumoto 2013) is also a good candidate. However, these generators are not maximally equidistributed. In earlier work, L' Ecuyer (1999) searched for 64-bit maximally equidistributed combined Tausworthe generators (with some additional properties). At present, though, to the best of our knowledge, there exists no 64-bit maximally equidistributed MT-type F 2 -linear generator with period 2 19937 − 1, such as a 64-bit variant of the WELL generators.
The aim of this article is to develop 64-bit maximally equidistributed F 2 -linear generators with similar speed as the 64-bit Mersenne Twisters (Nishimura 2000) . The key techniques are (i) state transitions with double feedbacks (Panneton et al. 2006; Saito and Matsumoto 2009 ) and (ii) linear output transformations with several memory references (Harase 2009 ). We provide a table of specific parameters with periods from 2 607 − 1 to 2 44497 − 1. The design of our generators is based on a combination of existing techniques, such as the WELL and dSFMT generators (Panneton et al. 2006; Saito and Matsumoto 2009 ), but we select state transitions differently from those of the original WELL generators to maintain the generation speed. We refer to these generators as 64-bit Maximally Equidistributed F 2 -Linear Generators (MELGs) with Mersenne prime period.
In practice, we often convert unsigned integers into 53-bit double-precision floating-point numbers in [0, 1) in IEEE 754 format. Our 64-bit generators are useful for this. To generate 64-bit output values, one can either use a pseudorandom number generator whose linear recurrence is implemented with 32-bit integers and then take two successive 32-bit blocks or, instead, use a pseudorandom number generator whose recurrence is implemented directly over 64-bit integers. As described later, the former method may degrade the dimension of equidistribution with v-bit accuracy, compared with simply using 32-bit output values. We consider the case of the 32-bit MT19937 generator in Section 4. For this reason, we develop 64-bit MELGs to directly generate 64-bit unsigned integers.
The article is organized as follows. In the next section, we review F 2 -linear generators and their theoretical criteria. We also summarize the framework of Mersenne Twisters. Section 3 is devoted to our main result: 64-bit MELGs. In Section 4, we compare our generators with others in terms of speeds, theoretical criteria, and empirical statistical tests. Section 5 concludes.
PRELIMINARIES

F 2 -Linear Generators
We recall the notation of F 2 -linear generators; see L' Ecuyer and Panneton (2009); for details. Let F 2 := {0, 1} be the two-element field (i.e., addition and multiplication are performed modulo 2). We consider the following class of generators.
Definition 2.1 ( F 2 -linear generator). Let S := F p 2 be a p-dimensional state space (of the possible states of the memory assigned for generators). Let f : S → S be an F 2 -linear state transition function. Let O := F w 2 be the set of outputs, where w is the word size of the intended machine, and let o : S → O an F 2 -linear output function. For an initial state s 0 ∈ S, at every time step, the state is changed by the recursion
and the output sequence is given by
We identify O as a set of unsigned w-bit binary integers. A generator with these properties is called an F 2 -linear generator.
Let P (z) be the characteristic polynomial of f . The recurrence (1) has a period of length 2 p − 1 (its maximal possible value) if and only if P (z) is a primitive polynomial modulo 2 (Niederreiter 1992; Knuth 1997) . When this value is reached, we say that the F 2 -linear generator has maximal period. Unless otherwise noted, we assume throughout that this condition holds.
Quality Criteria
Following (L' Ecuyer and Panneton 2009; , we recall two quality criteria for F 2 -linear generators. A most informative criterion for high-dimensional uniformity is the dimension of equidistribution with v-bit accuracy. Assume that an F 2 -linear generator has the maximal period 2 p − 1. We identify the output set O := F w 2 as a set of unsigned w-bit binary integers. We focus on the v most significant bits of the output and regard these bits as the output with v-bit accuracy. This amounts to considering the composition o
where the latter mapping denotes taking the v most significant bits. We define the k-tuple output function as
is the vector formed by the v most significant bits of k consecutive output values of the pseudorandom number generators from a state s 0 . Because o (k ) v is F 2 -linear, k-dimensional equidistribution with v-bit accuracy means that every element in (F v 2 ) k occurs with the same probability when the initial state s 0 is uniformly distributed over the state space S. As a criterion of uniformity, larger values of k (v) for each 1 ≤ v ≤ w are desirable (Tootill et al. 1973) . We have a trivial upper bound k
is called the dimension defect at v, and the sum of the gaps Δ := w v=1 d (v) is called the total dimension defect. If Δ = 0, the generator is said to be maximally equidistributed. For F 2 -linear generators, one can quickly compute k (v) for v = 1, . . . ,w by using lattice reduction algorithms over formal power series fields Harase 2011) . These are closely related to the lattice reduction algorithm originally proposed by Couture and L'Ecuyer (2000) .
As another criterion, the number N 1 of nonzero coefficients for P (z) should be close to p/2 (Compagner 1991; Wang and Compagner 1993) . For example, generators for which P (z) is a trinomial or pentanomial fail in statistical tests (Lindholm 1968; Matsumoto and Kurita 1996; Matsumoto and Nishimura 2002) . If N 1 is not large enough, the generator suffers a long-lasting impact for poor initialization known as a 0-excess state s 0 ∈ S, which contains only a few bits set to 1 (Panneton et al. 2006) . Thus, N 1 should be in the vicinity of p/2.
Mersenne Twisters
In general, to ensure a maximal period, testing the primitivity of P (z) is the bottleneck in searching for long-period generators (because the factorization of 2 p − 1 is required). If p is a Mersenne exponent (i.e., 2 p − 1 is a Mersenne prime), one can instead use an irreducibility test that is easier and equivalent. Matsumoto and Nishimura (1998) developed a pseudorandom number generator with a Mersenne prime period 2 p − 1 known as the Mersenne Twister. We briefly review their method. The state space S and state transition f : S → S are expressed as
where
denotes the w − r most significant bits of w i , N := p/w , and r is a non-negative integer such that p = Nw − r , so that the p bits of S are stored in an array of Nw bits in which there are r unused bits. The state transition of Equation (3) is implemented as
where w r i+1 ∈ F r 2 represents the r least significant bits of w i+1 , ⊕ is the bitwise exclusive-OR (i.e., addition in F w 2 ), and
This technique is called tempering. From this, we obtain an output sequence w N T , w N +1 T , w N +2 T , . . . ∈ O by multiplying a matrix T by the sequence from Equation (4). Matsumoto and Nishimura (1998) and Nishimura (2000) searched for 32-and 64-bit Mersenne Twisters with period length 2 19937 − 1, respectively. In Appendix B of Matsumoto and Nishimura (1998) , it is proved that these generators cannot attain the maximal equidistribution (e.g., Δ = 6750 for MT19937 in Matsumoto and Nishimura (1998) ). In fact, the state transition in Equations (3) and (4) is very simple, but the linear output transformation T in Equation (5) is rather complicated (see Matsumoto and Nishimura (1998) and Nishimura (2000) for details).
MAIN RESULT: 64-BIT MAXIMALLY EQUIDISTRIBUTED F 2 -LINEAR GENERATORS 3.1 Design
To obtain maximally equidistributed generators without loss of speed, we try to shift the balance of costs in f and o. The key technique is a suitable choice of (i) state transitions with double feedbacks proposed in Panneton et al. (2006) and Saito and Matsumoto (2009) and (ii) linear output transformations with several memory references from Harase (2009) 
where w i+N −1 and v i+1 are determined by the recursions of Equations (7) and (8), as described later. The first p − w bits are stored in an array in which r bits are unused, as for the original MTs. Note that the number of words is N − 1, not N . The remaining word v i is expected to be stored in a register of the CPU and updated as v i+1 at the next step, so that the implementation requires only a single word (see Figure 1 and Algorithm 1 in this subsection). The use of the extra state variable v i was originally proposed by Panneton et al. (2006) and refined by Saito and Matsumoto (2009) . (Note that v i,0 in Figure 1 of Panneton et al. (2006) corresponds to this variable v i .) This approach is a key technique for drastically improving N 1 and Δ. By refining the recursion formulas proposed in Saito and Matsumoto (2009) , we implement the state transition f with the following recursions:
A, B, and C are (w × w )-matrices defined indirectly as follows:
wB := w ⊕ (w σ 1 ),
where w = (w 0 , . . . ,w w −1 ) ∈ F w 2 and a ∈ F w 2 are w-bit vectors, σ 1 and σ 2 are integers with 0 < σ 1 , σ 2 < w, and "w l" and "w l" denote left and right logical (i.e., zero-padded) shifts by l bits, respectively. Note that v i is one component in a state s i ∈ S and is not the output.
To attain the maximal equidistribution exactly, we design a linear output transformation using another word in the state array, which comes from Harase (2009) . More precisely, for the righthand side of Equation (6), we consider the following linear output transformation o : S → O with one more memory reference:
Here, T 1 and T 2 are (w × w )-matrices defined by
where L is an integer with 0 < L < N − 2, σ 3 is an integer with 0 < σ 3 < w, & denotes bitwise AND, and b ∈ F w 2 is a w-bit vector. A circuit-like description of the proposed generators is shown in Figure 1 .
An equivalent formal algorithm can be described as Algorithm 1, in which, instead of shifting, we use a round-robin technique (i.e., a pointer technique) to improve the efficiency of the generation. Let w[0..N − 2] be an array of N − 1 unsigned integers with w bits, and v be a w-bit unsigned integer that corresponds to the extra state variable v i . Let x be a temporary variable and y be an output variable that are w-bit unsigned integers, respectively. Set m w −r ← (1, . . . , 1 w −r , 0, . . . , 0 r ) and m r ← (0, . . . , 0 w −r , 1, . . . , 1 r ). Here, m w −r is a bit mask that retains the first w − r bits and sets the other r bits to zero, whereasm r is its bitwise complement. Before Algorithm 1 begins, we initialize w[1], . . . , w[N − 2] , v ← initial values, not all zero, and set the pointer i ← 0. For more detailed descriptions of the initialization, see Remark 3.1. Input: w[0], w[1] , . . . , w[N − 2], v, and the pointer i
ALGORITHM 1: The Algorithm of the Proposed Generators
Output: w-bit output y Set x ← (w[i] & m w −r ) ⊕ (w[(i + 1) mod (N − 1)] &m r ). // compute (w w −r i | w r i+1 ). Set v ← xA ⊕ w[(i + M ) mod (N − 1)] ⊕ vB. // compute Equation (7). Set w[i] ← x ⊕ vC. // compute Equation (8). Set y ← w[i]T 1 ⊕ w[(i + L) mod (N − 1)]T 2 . // compute Equation (12). Increment i ← i + 1. If i ≥ N − 1, then i ← i mod (N − 1). // increment the pointer i. Return y
Specific Parameters
We search for specific parameters in the following way. First, we look for M, σ 1 , σ 2 , and a ∈ F w 2 in Equations (7)-(11) at random such that f attains the maximal period 2 p − 1. In general, because we can obtain several parameters, we choose a parameter set whose N 1 is large enough and whose output has large k (v) for v = 1, . . . ,w as far as possible in the case where we set y ← w[i] in Algorithm 1 (i.e., T 1 and T 2 are the identity and zero matrices, respectively). In the next step, we search for L, σ 3 , and b ∈ F w 2 in Equations (12)-(14) at random such that the generator is "almost" maximally equidistributed (i.e., Δ is almost 0). Finally, to obtain Δ = 0 strictly, we apply a slight modification to the bit mask b ∈ F w 2 by using the backtracking algorithm in Harase (2009) (with some trial and error). Table 1 lists specific parameters for 64-bit maximally equidistributed generators with periods ranging from 2 607 − 1 to 2 44497 − 1. We use the acronym "64-bit MELGs" for 64-bit "Maximally Equidistributed F 2 -Linear Generators" with Mersenne prime period, to the proposed generators. The code in C is available at https://github.com/sharase/melg-64.
In parallel computing, an important requirement is the availability of pseudorandom number generators with disjoint streams. These are usually implemented by partitioning the output sequences of a long-period generator into long disjoint subsequences whose starting points are found by making large jumps in the original sequences. For this purpose, Haramoto et al. (2008) proposed a fast jumping-ahead algorithm for F 2 -linear generators. We also implemented this algorithm for our 64-bit MELGs. The code is also available at the website just mentioned. The default skip size is 2 256 .
For our generators, we implemented a function to produce double-precision floating-point numbers u 0 , u 1 , u 2 , . . . in [0, 1) in IEEE 754 format by using the method found in section 2 of Saito and Matsumoto (2009) (i.e., the 12 bits for sign and exponent are kept constant, and the 52 bits of the significand are taken from the generator output). We note that this method is preferable from the viewpoint of k (v) because it does not introduce approximation errors by division.
Remark 3.1. Matsumoto et al. (2007) reported that many pseudorandom number generators have some nonrandom bit patterns when initial seeds are systematically chosen, especially when the initialization scheme is based on a linear congruential generator. To avoid such phenomena, the 2002 version of MT19937 adopted a nonlinear initializer described in equation (30) of Matsumoto et al. (2007) . For our proposed generators, we implemented a similar initialization scheme. Let w 0 ∈ F w 2 be an initial seed. To obtain an initial state s 0 = (w w −r 0 , w 1 , w 2 , . . . , w N −2 , v 0 ) ∈ S, we set w w −r 0 as the w − r most significant bits of w 0 and compute
using the usual integer arithmetic +, −, and ×, where a = 6364136223846793005 (in decimal notation) is a multiplier recommended in Knuth [1997, pp. 104] , σ 4 = 62, and w = 64. These parameters are from the 2004 version of 64-bit MT (Nishimura 2000) mentioned in Section 4. In a similar manner, we implemented an initializer whose seed is an array of integers of arbitrary length.
Remark 3.2. As pointed out in Section 3.8 of Saito and Matsumoto (2013) , it might be difficult to run the proposed generators in a multithreaded environment, such as GPUs. This is because our generators have heavy dependencies on the partial computation of the extra state variable v i in the recursion in Equation (7). Thus, in this case, it seems that the original MTs (Matsumoto and 1998; Nishimura 2000) are suitable because of the simplicity of recursion. We note that MTGP is designed specifically for this purpose (see Saito and Matsumoto (2013) for details).
PERFORMANCE
We compare the following F 2 -linear generators corresponding to 64-bit integer output sequences:
• MELG19937-64: the 64-bit integer output of our proposed generator; The first three generators have period length 2 19937 − 1. SFMT19937-64 has the period of a multiple of 2 19937 − 1 (see proposition 1 in Saito and Matsumoto (2008) for details). Table 2 summarizes the figures of merit N 1 , Δ, and timings. In Table 2 , we report the CPU time (in seconds) taken to generate 10 9 64-bit unsigned integers for each generator. The timings were obtained using two 64-bit CPUs: (i) a 3.40GHz Intel Core i7-3770 and (ii) a 2.70GHz Phenom II X6 1045T. The code was written in C and compiled with GCC using the -O3 optimization flag on 64-bit Linux operating systems. For SFMT19937-64, we measured the CPU time for the case of sequential generation (see Saito and Matsumoto (2008) for details).
MELG19937-64 is maximally equidistributed and also has N 1 ≈ p/2. These values are the best in this table. In terms of generation speed, MELG19937-64 is comparable to or even slightly faster than the MT19937-64 generators on the preceding two platforms. SFMT19937-64 without SIMD is comparable to or faster than MELG19937-64, and SFMT19937-64 with SIMD is more than twice as fast as MELG19937-64. However, Δ for SFMT19937-64 is rather large. In fact, the SFMT generators are optimized under the assumption that one will mainly be using 32-bit output sequences, so that the dimensions of equidistribution with v-bit accuracy for 64-bit output sequences are worse than those for 32-bit cases (Δ = 4188) . For this, we analyze the structure of SFMT19937 in the online Appendix.
Finally, we convert 64-bit integers into double-precision floating-point numbers u 0 , u 1 , u 2 , . . . in [0, 1) and submit them to the statistical tests included in the SmallCrush, Crush, and BigCrush batteries of TestU01 (L' Ecuyer and Simard 2007) . Note that these batteries have 32-bit resolution and have not yet been tailored to 64-bit integers. In our C-implementation, for MELG19937-64 and MT19937-64, we generate the preceding uniform real numbers by (x >> 11) * Remark 4.1. We occasionally see the use of the least significant bits of pseudorandom numbers in applications. An example is the case in which one generates uniform integers from 0 to 15 by taking the bit mask of the 4 least significant bits or modulo 16. For this, we invert the order of the bits (i.e., the ith bit is exchanged with the (w − i)-th bit) in each integer and compute the dimension of equidistribution with v-bit accuracy, dimension defect at v, and total dimension defect for inversion, which are denoted by k (v), d (v), and Δ , respectively. In this case, MELG19937-64 is not maximally equidistributed, but Δ = 4047 and d (v) is 0 or 1 for each v ≤ 11. Note that Δ takes values 9022, 8984, 21341 for MT19937-64, MT19937-64 (ID3), and SFMT19937-64, respectively. Δ of MELG19937-64 is still smaller than Δ of the other generators in Table 2 . However, as far as possible, we recommend using the most significant bits (e.g., by taking the right-shift in the preceding example) because our generators are optimized preferentially from the most significant bits.
Remark 4.2. For 32-bit generators, there have been some implementations that produce 64-bit unsigned integers or 53-bit double-precision floating-point numbers (in IEEE 754 format) by concatenating two consecutive 32-bit unsigned integers. We note that such conversions might not be preferable from the viewpoint of k (v). As an example, consider the 32-bit MT19937 generator in the header <random> of the C++11 STL in GCC (see (ISO 2012, Chapter 26.5) ). Let z 0 , z 1 , z 2 , . . . ∈ F 32 2 be a 32-bit unsigned integer sequence from 32-bit MT19937. To obtain 64-bit unsigned integers, the GCC implements a random engine adaptor independent_bit_engine to produce 64-bit unsigned integers from the concatenations as (z 0 , z 1 ), (z 2 , z 3 ), (z 4 , z 5 ), (z 6 , z 7 ), . . . ∈ F 64 2 .
To generate 53-bit double-precision floating-point numbers in [0, 1) (i.e., uniform_real_ distribution(0,1) for MT19937), the GCC implementation generates 64-bit unsigned integers (z 1 , z 0 ), (z 3 , z 2 ), (z 5 , z 4 ), (z 7 , z 6 ), . . . ∈ F 64 2 (16) by concatenating two consecutive 32-bit integer outputs and divides them by the maximum value 2 64 . The sequences in Equations (15) and (16) can be viewed as F 2 -linear generators with the state transition f 2 in Equation (3), so that we can compute k (v). Note that the sequences in Equations (15) and (16) are different: Equation (16) is obtained by exchanging the 32 most significant bits for the 32 least significant bits in each 64-bit word in Equation (15); that is, the F 2 -linear output functions are described as s i ∈ S → (o(s i ), o( f (s i ))) ∈ F 64 2 in Equation (15) and s i ∈ S → (o( f (s i )), o(s i )) ∈ F 64 2 in Equation (16). In fact, their k (v)'s are different for v = 33, . . . , 64. As a result, we have Δ = 13543 for v = 1, . . . , 64 in Equation (15) and Δ = 13161 for v = 1, . . . , 52 in Equation (16), which are worse than Δ of MT19937-64. In particular, k (12) = 623 < 19937/12 = 1661 for each case. For this reason, we feel that there is a need to design 64-bit high-quality pseudorandom number generators.
CONCLUSION
In this article, we have designed 64-bit maximally equidistributed F 2 -linear generators with Mersenne prime period and searched for specific parameters with period lengths from 2 607 − 1 to 2 44497 − 1. The key techniques are (i) state transitions with double feedbacks and (ii) linear output transformations with several memory references. As a result, the generation speed is still competitive with 64-bit Mersenne Twisters on some platforms. The code in C is available at https://github.com/sharase/melg-64. Pseudorandom number generation is a tradeoff between speed and quality. Our generators offer both high performance and computational efficiency.
