Low-discrepancy sequences, also known as " quasirandom" sequences, are 
Introduction
In general, Monte Carlo (MC) methods are algorithms which use random sampling to stochastically model systems. A specific MC method is Monte Carlo integration-a numerical integration technique to approximately evaluate a definite integral. While conventional numerical integration evaluates integrands at regularly spaced points in the integrand domain, Monte Carlo integration samples (evaluates) the integrand at multiple random points. Monte Carlo integration is used extensively to compute multi-dimensional integrals that occur frequently in physics, finance, etc. since it can be as accurate as and much faster than conventional integration. The random inputs for MC integrations and MC simulations are usually uniform random variates (RVs) provided by a pseudo-random number generator (PRNG). MC simulations are often inherently parallel and, consequently, are increasingly implemented on reconfigurable hardware such as FPGAs.
Real-world MC integration uses a finite quantity of pseudo-random numbers. A critical issue with these numbers is that they may not be perfectly 'equidistributed,' i.e. this finite quantity of pseudorandom numbers actually used are not spread uniformly throughout the domain of the integrand, which leads to poorer results. While equidistribution as well as accuracy improve as more random numbers are used, this requires longer run-times. To solve this problem, low-discrepancy sequences that are well-distributed even for small quantities were introduced [1] . Lowdiscrepancy sequences (LDS) are also called quasirandom numbers; therefore, MC methods utilizing lowdiscrepancy sequences are referred to as Quasi-Monte Carlo (QMC) methods.
It is simplest to illustrate the distinction between pseudo-random numbers and low-discrepancy sequences with an example. Figure 1a shows 100 random points from the well-known Mersenne Twister PRNG.
Each square subregion does not contain a similar number of points and clumping can be seen. Figure 1b shows 100 points from the Sobol [2] low-discrepancy sequence. Each square subregion contains roughly the same number of points, making the points more even and resulting in a higher degree of equidistribution.
Quasi-Monte Carlo integration is common in software but, to our knowledge, no hardware implementations exist in the literature in spite of the recent trend of performing MC methods in hardware. We attribute this to a corresponding lack of hardware low-discrepancy sequence generators. In this paper, we remedy this issue by presenting the first FPGA-optimized, scalable designs for three common low-discrepancy sequences: Sobol [2] , Niederreiter [3] and Halton [4] .
We begin with mathematical preliminaries on lowdiscrepancy sequences, followed by algorithms for generating three low-discrepancy sequences in binary arithmetic. FPGA-optimizations and scalability are discussed next, followed by results of our implementations with various levels of parallelization on the Xilinx Virtex-4 FPGA. We conclude by interfacing a real-world Monte Carlo integration problem to our generators and analyzing performance.
Low-Discrepancy Sequences

Defining Discrepancy
Before discussing methods for generating lowdiscrepancy sequences, let us define discrepancy [5] . 
Intuitively, the discrepancy is the difference between the proportion of points in J compared to the full unit cube I s and the volume of the 'box' J compared to I s . Figure 2 shows a 2-dimensional hypercube I s D OE0; 1/ 2 (a square) with six points. Three sub-intervals (or boxes) A, B and C are shaded. The discrepancy of box A, which contains 3 points, is calculated from eq. Similarly, the discrepancies of boxes B and C are 0:1267 and 0:8 respectively.
The worst-case discrepancy, i.e. the worst-case distribution of a set/sequence of points fx 1 ; x 2 ; : : : ; x N g 2 I s is called the star-discrepancy [5] and is defined as:
The goal of a low-discrepancy sequence is to minimize this star discrepancy.
The Halton Sequence
Historically, low-discrepancy sequences were not designed with digital arithmetic in mind. The Van der Corput sequence [4] was the first such sequence; it takes the natural numbers f1; 2; : : : g and reverses their representation in a particular base b. Figure 3 shows the first seven terms of the base-2 (binary) Van der Corput sequence; the fractional values are the reversed binary representations of the term's index in the progression.
The Halton sequence [4] generalizes the Van der Corput sequence to higher dimensions. Each dimension is represented in a different prime base b (e.g., 2; 3; 5; 7; : : : ). To generate the n-th point in a sequence, consider the base b-ary expansion of a n: 
The Sobol Low-Discrepancy Sequence
Modern low-discrepancy sequences belong to the general class of digital sequences [6] . Essentially, digital sequences are constructed using binary operations on binary expansions and are therefore well-suited to efficient implementations on computers. Further discussion of the mathematical theory behind digital sequences is beyond our scope; the interested reader may consult [6] and [7] .
The Sobol was the first digital sequence [2] . It operates in base-2 and is still well-regarded for use in quasi Monte-Carlo. We discuss the algorithm for generating a Sobol sequence based on [8] . To generate one sequence (i.e., one dimension) of N -bit lowdiscrepancy Sobol numbers, we choose odd integers m i .0 Ä i < N /, and define N direction vectors c i :
where c ij denote the binary expansion of c i . Now, choose a primitive polynomial P .x/ of degree d with co-efficients a i from the two-element finite (or Galois) field GF .2/ (i.e., binary):
These coefficients a i are used to calculate each direction vector c i as:
where˚is an exclusive-or (XOR), and the last term is c i d right-shifted by d bits. A one-dimensional N -bit wide low-discrepancy Sobol sequence x 1 ; x 2 ; : : : can be generated based on this set of direction vectors. Take the n-th term of this sequence, x n , with n D b N b N 1 : : : b 2 b 1 in binary. Then,
If the direction vectors c i are pre-computed, generating one number requires at most N lookups and N 1 XORs. This effort can be drastically reduced by considering a gray-coded representation of n [9]-a gray-coded n C 1 differs from gray-coded n in only one bit. The gray-code representation for n can be obtained by and the bit g r that flips going from n ! n C 1 is simply the position r of the least-significant zero-bit (LSZ) in n D b N : : : b 1 . Now, since n C 1 differs from n by only one bit, x nC1 'differs' from x n by only one direction vector c r . x nC1 can therefore be computed based on x n as
with only one lookup and one XOR; the complexity of finding the least-significant zero-bit r of n can also be decreased from the standard O.log n/ in hardware by using a priority encoder.
Example: Generating a Sobol Sequence
We start with the degree d D 3 primitive polynomial 
Once we have computed the direction vectors, the sequence can be generated. Setting the initial conditions for the zeroth term: x 0 D 0, n D 0, and the position of the least-significant-zero in n D 0 is r D 1. The first three numbers of this sequence are then calculated as shown in Table 3 .
Niederreiter Sequences
Niederreiter sequences are digital sequences that can be thought of as generalizing the base-2 Sobol to other bases. Asymptotically (i,e., in the limit of an infinite number of points), Niederreiter sequences have the lowest star discrepancy among all other low-discrepancy sequences. They are generated in exactly the same manner as Sobol (section 2.4), although the direction vectors are calculated differently. Just as with Sobol, in practice a multi-dimensional Niederreiter sequence consists of multiple base-2 generators with unique sets of direction vectors.
Generating Low-discrepancy Sequences in Hardware
For the Sobol/Niederreiter sequences (and for any digital sequence in general), we pre-compute the N direction vectors and store them in RAM. Generating each term of the actual sequence involves the leastsignificant zero (LSZ) calculation, a memory lookup and an XOR as in eq. (7). While this process can be made fairly efficient with techniques such as pipelining, we can exploit the LSZ, binary arithmetic, memory structures and even an application's requirements to parallelize and optimize the generation.
Computing and Exploiting the LeastSignificant-Zero (LSZ) Position
From the gray-code recursion in eq. (7), finding the position r of least-significant zero bit (LSZ) in n is critical in choosing the direction vector to compute x nC1 . For generating N -bit numbers, finding the LSZ for an index n incurs O.log n/ complexity with a sequential 'shift-and-count' algorithm. However, this is reduced to constant time through the flexibility of programmable logic: an inverted N -line-to-log.N /-line priority encoder. The least-significant zero bit in an input asserts priority over others and its decimal position is returned by the encoder. For the 32-bit numbers generated by our implementations, an input of 110100101, for example, would produce an output of 2.
Applications usually require multiple random inputs or multi-dimensional sequences. Since the LSZ for each set of terms is the same, one common LSZ-detector circuit can be used. LSZ positions r for certain n can also be pre-coded; for p 2 f 1; 2; : : : ; log 2 N g, and
the LSZ position r Á p and the required direction vector c p is a 'constant' that can be pre-coded instead of computing the LSZ and looking the c r up. This works in practice because, for example, every even n has LSBs : : : xx0 and therefore LSZ r D 1, every fourth n D 1; 5; : : : has LSBs : : : xx01 and LSZ r D 2, etc. The sequence terms x n for these n can be computed with their pre-coded direction vectors in parallel with the x n 's that require an LSZ and memory lookup for their direction vector. We call this scheme p-way parallelization. While the user can control the degree of parallelization by choosing the c p 's to pre-code, we note that asymptotically as p increases, the n satisfying (9) decrease geometrically, giving a maximum possible degree of parallelization of 2 C 1 C P N pD1 1 2 p 4. Figure 4 shows a 2 parallel generator where the direction vector for every even term c 1 is pre-coded (i.e., p D 1).
Architectural Optimizations for Further Parallelization
Memory (RAM) in contemporary FPGAs, whether constructed from logic/LUTs ('distributed' RAM) or SRAM blocks (BRAM), has dual asynchronous I/O ports. In that case, throughput can be doubled by duplicating the LSZ circuit and computing future terms of the sequence. Since generation only requires reads, the direction vectors c i can be duplicated across multiple dual-port RAMs for a higher effective number of ports; we call this o-port parallelization. Figure 5 illustrates a 2-port, 2-way parallelization that generates 4 terms in one cycle. If the index n is generated by a count-by-4 counter, i.e. n D 0; 4; : : : , the four terms are generated as: The optimized circuits outlined so far address the generation of one-dimensional sequences. To generate multiple dimensions, these circuits are replicated. Depending on the balance between throughput and area-efficiency that is desired, especially if physical block RAMs are being used to store the direction vectors instead of distributed RAM, the memories can contain two different sets of direction vectors and simultaneously generate two sequences because of the availability of the dual-ports. Figure 6 shows a 2-dimensional, 2-way parallelized generator.
Virtex-4 Implementations
We implemented a 2-D Halton generator (in bases 2 and 3) as well as 2 /4 parallelized 2-D and 6-D Sobol/Niederreiter generators on the Xilinx Virtex 4-SX FPGA (XC4VSX35), synthesized with Synplify Pro 9.1. 32-bit numbers were generated, with distributed RAM (LUTs) used as memory for the direction vectors (32 32 D 1024-bits per dimension). The results, including resource usage, maximum frequency and throughput are shown in Table. 4; the throughput/area-efficiency increases from 2 to 4 parallelization.
Comparison with Pseudo-RNGs. Since no hardware LDS implementations exist, we compare relative complexities with the state-of-the-art parallelized pseudo- random number generators (Mersenne Twister) [10] ; both the resource usage (area) and throughput of our low-discrepancy sequence generators are comparable to parallelized hardware PRNGs.
Application: Quasi-Monte Carlo for 3-D IC Partial Inductance Extraction
Inductances between interconnects (traces) in integrated circuits (IC) are an important limiting factor as designers seek to scale these chips to ever-higher clock frequencies. Modeling these inductances is an indispensable part of high-frequency IC design. Analytically deriving the mutual inductance between a set of interconnects of arbitrary geometry involves a double-volume (i.e., 6-dimensional integral) and precise knowledge of the instantaneous current densities through each of the interconnects. Instead, the mutual inductance can be numerically approximated through Monte Carlo integration [11] .
For example, consider deriving the mutual inductance M 6 for two 3-D interconnects placed parallel to each other, with dimensions and separations as shown in Figure 7 [11] . The analytical solution (M 6 ) for this problem is known, which allows us to benchmark Monte Carlo approximations. 
where is a constant and we are effectively computing the inverse of the Euclidean distance (i.e., 1= p .r i r j / 2 ) between each point pair. We implemented the M 6 computation on the Virtex-4, using DSP slices for squaring the distances and a pipelined CORDIC for the square root. QuasiMonte Carlo integration was performed using 1,000 low-discrepancy Sobol and Niederreiter points (16-bit fractions). Table 5 shows the M 6 values computed and compares them to the analytical results as well as results from a standard MC integration using 1,000 pseudorandom numbers; the Quasi-Monte Carlo results are over an order of magnitude more accurate than those from conventional Monte Carlo.
Conclusion
Low-discrepancy sequences are essential for QuasiMonte Carlo (QMC) methods such as multi-dimensional integration; quasi-Monte Carlo delivers more accurate results in a shorter time than conventional Monte Carlo. While MC simulations are increasingly parallelized and implemented on reconfigurable hardware, this is not true for QMC due to the lack of hardware low-discrepancy sequence generators.
We present the first (to our knowledge) designs for low-discrepancy sequence generation in hardware. Our techniques for optimization and parallelization can be applied to the entire class of digital low-discrepancy sequences; we implemented three specific sequences (Halton, Sobol and Niederreiter) with varying degrees of parallelization on the Virtex-4 FPGA. We also demonstrated the supremacy of QMC over MC with a real-world example involving the extraction of mutual inductances in integrated circuit interconnects. Future work includes an automated GUI-based program that creates synthesizable VHDL for a generator given the parameters for a specific digital sequence.
