Abstract-We present a hardware Gaussian noise generator based on the Box-Muller method that provides highly accurate noise samples. The noise generator can be used as a key component in a hardware-based simulation system, such as for exploring channel code behavior at very low bit error rates, as low as 10 À12 to 10 À13 . The main novelties of this work are accurate analytical error analysis and bit-width optimization for the elementary functions involved in the Box-Muller method. Two 16-bit noise samples are generated every clock cycle and, due to the accurate error analysis, every sample is analytically guaranteed to be accurate to one unit in the last place. An implementation on a Xilinx Virtex-4 XC4VLX100-12 FPGA occupies 1,452 slices, three block RAMs, and 12 DSP slices, and is capable of generating 750 million samples per second at a clock speed of 375 MHz. The performance can be improved by exploiting concurrent execution: 37 parallel instances of the noise generator at 95 MHz on a Xilinx Virtex-II Pro XC2VP100-7 FPGA generate seven billion samples per second and can run over 200 times faster than the output produced by software running on an Intel Pentium-4 3 GHz PC. The noise generator is currently being used at the Jet Propulsion Laboratory, NASA to evaluate the performance of low-density parity-check codes for deep-space communications.
D
UE to recent advances in field-programmable technology, hardware-based simulations are getting increasing attention due to their huge performance advantages over traditional software-based methods. Naive hardware implementations can be slow and can generate misleading results. Hence, care must be taken when mapping algorithms into hardware. In particular, the resulting hardware design should meet the performance targets while making efficient use of the available resources and properly managing errors due to, for instance, finite precision effects.
The availability of normally distributed random samples is essential in a large number of computationally intensive modeling and simulation applications, including channel code evaluation [1] , molecular dynamics simulation [2] , and financial modeling [3] . Our work is originally motivated by ongoing advances in communications systems involving channel codes. Notably, turbo codes [4] and low-density parity-check (LDPC) codes [1] are currently the focus of intensive research in the coding community due to their ability to approach the Shannon bound very closely. Software-based simulations can take several days to several weeks when the behavior at very low bit error rates (BERs) of such codes is being examined. Hardware-based simulations equipped with a fast and accurate noise generator offer the potential to speed up simulation by several orders of magnitude. Transferring software-generated noise samples to the hardware device is highly inefficient and can be a performance bottleneck, hence it is desirable to have the noise generator on the hardware device itself.
For simulations involving large numbers of samples, the quality of the noise samples plays a key factor. Deviations from the ideal Gaussian probability density function (PDF) can degrade simulation results and lead to incorrect conclusions. Hence, we believe the presence of a rigorously designed and characterized hardware Gaussian noise generator is crucial. Attention needs to be paid to the samples that lie at the tails of the Gaussian PDF, i.e., samples that lie multiples of (standard deviations) away. These samples are rare in a relative sense, but they are important because they can cause events of high interest.
The principal contribution of this paper is a hardware Gaussian noise generator based on the Box-Muller method and the error analysis of its elementary functions. It generates 16-bit noise samples accurate to one unit in the last place (ulp) up to 8:2, which models the true Gaussian PDF accurately for a simulation size of over 10 15 samples. Generally, when evaluating channel codes, one needs 100 to 1,000 bits in error to draw conclusions on a simulation with enough confidence. Hence, with 10 15 samples, one can examine channel code behavior for bit error rates as low as 10 À12 to 10 À13 . The noise generator is relatively small, while producing 750 million samples per second at a clock speed of 375 MHz on a Xilinx Virtex-4 XC4VLX100-12 FPGA. The highlights of this paper include:
. a hardware architecture for the Box-Muller method, . piecewise polynomial based function approximation units with range reduction, . accurate error analysis and bit-width optimization leading to a guaranteed maximum absolute error bound of 1 ulp, . exploration of hardware implementation of the proposed architecture targeting both advanced high-speed FPGAs and low-cost FPGAs, and . the only reported Gaussian noise generator with a formal error analysis. The rest of this paper is organized as follows: Section 2 covers background material and previous work. Section 3 provides an overview of the proposed design flow and the Box-Muller hardware architecture. Section 4 describes how we evaluate the elementary functions associated with the Box-Muller method. Section 5 introduces the MiniBit bitwidth optimization approach. Section 6 describes the error analysis and bit-width optimization procedures for our BoxMuller architecture. Section 7 describes technology-specific implementation of the hardware architecture on Xilinx FPGAs. Section 8 discusses evaluation and results and Section 9 offers conclusions.
BACKGROUND
In simulation environments, digital methods for generating Gaussian random variables are preferred over analog methods. Analog components allow truly random numbers, but are highly sensitive to environmental changes such as temperature and provide low throughputs of tens of kilobits per second. Such methods are often used for generating random seeds in cryptographic applications [5] . In contrast, digital methods are more desirable due to their robustness, flexibility, and speed. Although the resulting number sequences are pseudorandom as opposed to truly random, the period can be made sufficiently large such that the sequences never repeat themselves, even in the largest practical simulations.
The majority of the digital methods for generating Gaussian random variables are based on transformations on uniform random variables [6] . Popular methods include the Ziggurat method [7] , the inversion method [8] , the Wallace method [9] , and the Box-Muller method [10] . Ziggurat is a class of rejection-acceptance methods, meaning that the output rate is not constant, making it less desirable to a hardware simulation environment. The inversion method involves the approximation of the inverse Gaussian cumulative distribution function (CDF), which is highly nonlinear, making it less suitable for a fixed-point hardware implementation. In contrast to all other methods, Wallace does not require the evaluation of elementary functions; new noise samples are generated by applying linear transformations to the previous pool of noise samples. Unfortunately, due to the feedback nature of the Wallace method, correlations can occur between successive transformations. These correlations can be made insignificant for any given simulation environment through proper parameter choice, but the need to manage them represents an additional complication inherent in the Wallace method. Our choice for hardware implementation is the Box-Muller method, which transforms two uniformly distributed variables into two normally distributed variables through a series elementary function evaluations.
Recently, there have been notable research contributions on the hardware implementations of the methods discussed above. Boutillon et al. [11] were the first to realize a hardware Gaussian noise generator based on the BoxMuller algorithm and the central limit theorem. The central limit theorem is employed to overcome approximation errors of the mathematical functions of the Box-Muller method. Their design occupies 437 logic cells on an Altera Flex 10K1000EQC240-1 FPGA and has a throughput of 24.5 million samples per second. Xilinx [12] have released an IP core and Fung et al. [13] have implemented an ASIC chip based on Boutillon et al.'s architecture. The former has a throughput of 245 million samples per second on a Xilinx Virtex-II XC2V1000-6 FPGA, whereas the latter has a throughput of 182 million samples per second on a six metal layer 0.18m ASIC. Unfortunately, there are two drawbacks of this architecture: 1) It is limited to noise samples with magnitude less than 4 and 2) statistical tests reveal that the quality of the noise samples is poor [14] . Unlike the two Box-Muller architectures above, the design presented in this paper employs highly accurate elementary function evaluation techniques, eliminating the need for the central limit theorem altogether.
In [15] , we presented an architecture also based on the Box-Muller method and central limit theorem, but with more sophisticated function approximation techniques, resulting in significantly higher quality noise samples. The key differences between our previous work [15] and the work presented here are:
1. the way the mathematical functions are evaluated, 2. the accuracy of the noise samples, 3. the noise quality in the tails, as expressed by the maximum attainable multiple, and 4. the hardware efficiency, as illustrated in Table 2 . In [15] , we evaluated the nonlinear functions of the BoxMuller method using degree one piecewise polynomials with nonuniform segmentation. The approximation and quantization errors were found to be high, forcing us to use the central limit theorem to improve noise quality. This resulted in an output rate of one sample per clock cycle. In addition, the maximum attainable multiple was 6.7. By contrast, in this work, we evaluate the elementary functions one by one and rigorously analyze the errors involved and, in doing so, are able to obtain a maximum multiple of 8.2. This added performance is critical for many large communications simulations. The central limit theorem is not required, resulting in an output rate of two samples per clock. Both samples are guaranteed to be accurate to 1 ulp, while using minimal bit-widths for the various signals in the data paths. In terms of hardware efficiency on an FPGA, this work achieves more than five times the throughput and occupies 40 percent less logic than [15] .
In [14] and [16] , we also describe hardware architectures of the Wallace and the Ziggurat methods. We provide detailed comparisons between different architectures in Section 8.
We choose a Xilinx Virtex-4 XC4VLX100-12 FPGA to realize our noise generator hardware architecture. Xilinx Virtex-4 FPGAs have the following three main types of resources: 1) user configurable elements known as "slices," 2) storage elements known as "block RAMs," and 3) multiply-and-add units known as "DSP slices." A single slice is composed of 2.25 logic cells, which is the fundamental building block of Xilinx FPGAs. A logic cell is comprised of a 4-input lookup table, which can also be used as a 16 Â 1 RAM or a 16-bit shift register, a multiplexor, and a register. A slice contains additional resources, such as multiplexors and carry logic, and, therefore, a slice is counted as being equivalent to 2.25 logic cells. Each block RAM can store 18Kb of data and a DSP slice can perform 18-bit by 18-bit multiplication followed by 48-bit addition. The Xilinx Virtex-4 XC4VLX100-12 device contains 49,152 slices, 240 block RAMs, and 96 DSP slices.
DESIGN FLOW AND ARCHITECTURE
This section provides an overview of our design methodology and introduces the operations involved in the BoxMuller method. We also discuss the specifications given for our noise generator and their implications for the BoxMuller method.
Design Flow
The design flow of our Box-Muller implementation is illustrated in Fig. 1 . We first start by devising the specifications for the noise generator, which includes the periodicity of the samples, noise precision requirements, and throughput requirements. Although software implementations often use floating-point arithmetic, fixed-point arithmetic is often preferred for hardware implementations due to its area efficiency.
In order to meet the accuracy requirements of the noise samples, careful error analysis needs to be performed on the fixed-point data paths to determine the minimal bit-widths required. We generate the polynomial coefficient tables using MATLAB and the MATLAB Symbolic Toolbox, which contains the kernel of the MAPLE linear algebra package. After the bit-widths have been determined and the polynomial coefficient tables have been generated, we implement MATLAB and C models of the noise generator. These software models are programmed to be bit-accurate to the actual hardware realization, by emulating the quantization effects of the arithmetic operations.
Through comparisons with IEEE double-precision floating-point arithmetic, tests are conducted to check if the accuracy requirements of the noise samples are met. After the software implementations are finalized, we implement the hardware design using Xilinx System Generator [17] , which is a MATLAB Simulink library and can generate Verilog or VHDL. The hardware design is verified carefully to ensure that it behaves the same way as the software models.
Architecture for the Box-Muller Method
The Box-Muller method starts with two independent uniform random variables, u 0 and u 1 , over the interval ½0; 1Þ. The following mathematical operations are performed to generate two samples, x 0 and x 1 , of a Gaussian distribution Nð0; 1Þ.
The above equations lead to an architecture depicted in Fig. 2 . Although traditional linear feedback shift registers (LFSRs) are often sufficient as a uniform random number generator (URNG), Tausworthe URNGs [18] are fast and occupy less area. Furthermore, they provide superior randomness when evaluated using the Diehard random number test suite [19] as demonstrated in [16] . Our Tausworthe URNG follows the algorithm presented by L'Ecuyer [20] , which combines three LFSR-based URNGs to obtain improved statistical properties. It generates a 32-bit uniform random number per clock and has a large period of 2 88 ð% 10 25 Þ. Its implementation in C code is illustrated in Fig. 3 .
The implementation of the three function evaluation units, logarithm, square root, and the sine and cosine (sin/ cos) unit, will all be analyzed in Section 4 and Section 6. The two numbers shown in brackets for each signal in Fig. 2 indicate the total bit-width and the fraction bit-width of the signal. The derivation of these bit-widths will be discussed in detail in Section 6.
Specifications
Two key specifications are set for our noise generator: periodicity of 10 15 and 16-bit noise samples. In order to meet the periodicity requirement, the URNGs should have a period of at least 10 15 . But, at the same time, the noise samples should follow the ideal Gaussian distribution as closely as possible over the period. By examining the normal distribution Nð0; 1Þ using MAPLE, we observe that we need to be able to represent up to 8:1 for a population of 10 15 samples. In other words, the probability of the absolute value of a single sample from that population being larger than 8:1 is less than 0.5 (0.444 to be exact). By examining (1)- (6), the maximum value is determined by the smallest value of f, which in turn is determined by the smallest value of u 0 , i.e.,
We use 48 bits for u 0 , which gives a minimum value of
, meeting the requirement in (7). This also means that the maximum value we can attain is 8.2. There is no logical way to determine the number of bits required for u 1 , except that it should have a good resolution. We use 16 bits for u 1 , which is the same bit-width as the noise samples. Hence, conveniently, two 32-bit Tausworthe URNGs are utilized to provide the 48 bits and 16 bits required for u 0 and u 1 .
We use two's complement fixed-point representation for the noise samples. The maximum absolute value of 8.2 means that 5 bits are sufficient for the integer bit-width (IB), leaving us with 11 bits for the fraction bit-width (FB). We would like to represent the noise samples as accurately as possible within the given 16 bits. Our criterion for evaluating the accuracy is ulp. The ulp of a fixed-point number with 11 bits fraction is 2
À11 . There are two main types of rounding commonly used in computer arithmetic: faithful rounding and exact rounding.
Faithful rounding means that results are accurate to 1 ulp (rounded to the nearest or next nearest) and exact rounding means that results are accurate to 1=2 ulp (rounded to the nearest). Exact rounding is difficult to achieve due to a problem known as the table maker's dilemma [21] and has a rather large area penalty [22] , hence we opt for faithful rounding in this work. Therefore, for every noise sample, the maximum absolute error compared against infinite precision should be less than or equal to 2 À11 . For the purposes of this analysis, we regard IEEE double-precision floating-point to be "infinitely precise" since it is many orders of magnitude more accurate than the precision we are aiming for.
FUNCTION EVALUATION
Throughout this paper, we shall denote the bit-width of a signal x as B x , the integer bit-width as IB x , the fraction bitwidth as F B x , and its associated error as E x .
Consider an elementary function fðxÞ, where x and fðxÞ have a given range ½a; b and precision requirement. The evaluation fðxÞ typically consists of three steps [23] :
1. range reduction: reducing x over the interval ½a; b to a more convenient y over a smaller interval ½a 0 ; b 0 , 2. function approximation on the reduced interval, and 3. range reconstruction: expansion of the result back to the original result range. Our function evaluation steps for À2 lnðu 0 Þ, ffiffi ffi e p , sinð2u 1 Þ, and cosð2u 1 Þ are based on the methods presented in [22] and [24] . If a variable x is separated into a sign bit S x , a mantissa M x , where M x ¼ ½1; 2Þ, and an exponent E x , i.e., x ¼ ðÀ1Þ
Sx Â M x Â 2 E x , the following mathematical identities are used for the evaluation of the logarithm and the square root when S x ¼ 0:
We observe from the equations above that the range reduction steps of the logarithm and the square root are essentially fixed-point to floating-point conversions. For the evaluation of sin/cos, we exploit the symmetric and periodic behavior of the two functions. As illustrated in Fig. 4 , only cosðxÞ over x ¼ ½0; =2Þ needs to be approximated. To approximate the functions over the linear range, we use piecewise polynomials with uniform segments. Polynomials are evaluated using Horner's rule:
where x is the input, d is the polynomial degree, and C i are the polynomial coefficients. The hardware architecture for a degree two piecewise polynomial is shown in Fig. 5 . A degree one architecture is similar but without the first multiply-and-add unit. The input interval is split into 2
equally sized segments. The B x A leftmost bits of the argument x serve as the index into the table, which holds the polynomial coefficients for that particular interval. Since x A is implicitly known for a given interval, we use x B instead of x for the polynomial arithmetic to reduce the size of the operators. We scale x B to be over ½0; 1Þ to simplify the error analysis phase in Section 6. If x ¼ ½0; 1Þ, this would involve masking out the bits corresponding to x A and shifting x by B x A bits to the left. The polynomial coefficients are found in a minimax sense that minimize the maximum absolute error [23] using MAPLE. However, the coefficients are generated with the assumption that x will be used for the polynomial arithmetic. If we want to use x B instead, the coefficients need to be transformed. We consider a degree one polynomial to illustrate the transformation process, but the same principle can be applied to other degrees. For a given segment, we obtain x B by
Rearranging the equation, we obtain
A degree one polynomial is represented by the equation
and, by substituting (12) into (13), we get
By examining the first and zeroth order terms, the new transformed polynomial coefficients
With the proposed architecture in Fig. 5 
The main challenge is to find the minimal bit-widths for each signal while meeting the output error constraints. We discuss how we achieve this in the next two sections.
THE MINIBIT BIT-WIDTH OPTIMIZATION APPROACH
In this section, we briefly describe the fundamental principles behind the MiniBit bit-width optimization approach [25] , a technique for optimizing fixed-point signals using analytical error expressions with a guaranteed maximum error bound. In digital systems, signals need to be quantized to finite precisions. In order to minimize area and meet the accuracy requirement, each signal should use the minimal bit-width, while the final output signals should obey the accuracy requirements. There are two main ways to quantize a signal: truncation and round-to-nearest. Truncation and round-to-nearest can cause a maximum error of 2 ÀF B (1 ulp) and 2 ÀF BÀ1 (1=2 ulp), respectively. Truncation chops bits off the least significant parts and requires no extra hardware resources. Although round-to-nearest requires a small adder, we opt for round-to-nearest since it allows for smaller bit-widths than truncation.
Letã a be the quantized version and Eã a be the error of the signal a. Thus,ã
For addition/subtraction operations y ¼ a AE b, the error Eỹ y at the output y is given by 
where 2 ÀF Bỹ yÀ1 is the rounding error ofỹ y. For multiplication, we getỹ
Eỹ y would be at its maximum when a and b are at their maximum absolute values. Consider a slightly more complex example: y ¼ a Â b þ c. We want to quantize signals after each operation, hence, we compute the example in two steps:
The error Eỹ y is given by
For faithful rounding, the maximum output error maxðEỹ y Þ needs to be less than or equal to 1 ulp, i.e.,
Suppose we wantỹ y's fraction to be accurate to 16 bits, with a, b, and c being constants rounded to the nearest. Assume that the maximum absolute values are maxðjajÞ ¼ 2, maxðjbjÞ ¼ 4. Note that maxðjcjÞ is irrelevant to this error analysis. Then, using (21) and (22), we get
The error terms are functions of the fraction bit-widths of the signals and the maximum value of the signals, hence the optimization problem is to find the minimal fraction bitwidths for the signalsã a,b b,c c, andt t while satisfying the inequality. For small problems like this, enumeration may be appropriate, but, for complex equations involving a large number of signals, methods such as simulated annealing can be used [25] . 
ERROR ANALYSIS AND BIT-WIDTH OPTIMIZATION FOR NOISE GENERATOR
In this section, we describe how the MiniBit error expressions are used to optimize the bit-widths of the signals in our Box-Muller architecture using a bottom-up approach. We shall discuss the fraction bit-width (FB) optimization problem only, which we have found to be significantly more challenging than integer bit-width (IB) optimization. In order to find the optimal IBs, we analyze the signals carefully and examine their dynamic ranges. This manual optimization process is found to be feasible for the noise generator since the dynamic ranges of the signals are straightforward and predictable. We make the assumption that all function evaluations are faithfully rounded throughout. The evaluation steps for the Box-Muller architecture are described in the pseudocode in Fig. 6 . 
Error Analysis at the Output
The accuracy requirement of the noise samples x 0 and x 1 is faithful rounding, i.e., they should be accurate to 1 ulp. Knowing that the samples have 11 fractional bits, the requirements are
We shall consider the data path to x 0 only since the error analysis is identical for x 1 . Assuming we faithfully round f and g 0 , we get the following error expression:
We are interested in the worst-case error, which occurs when g 0 and f are at their maximum of 1 and 8.157. Hence, we get
Given that f has a deeper computation chain than g 0 , we would prefer that its computation requirements are less precise so that fewer bits are needed for its implementation.
Through an exhaustive search, we find that F B f ¼ 13 and F B g0 ¼ 15 are the minimal bit-widths that meet the inequality in (26) . Now that we have determined the bit-widths for f and g 0 , we can move on to the square root unit and the sin/cos unit. We shall first perform analysis for the sin/cos unit since it is easier to analyze due to the shorter computation chain.
Error Analysis for the Sin/Cos Unit
The sin/cos unit corresponds to lines 37-56 in Fig. 6 . The range reduction and range reconstruction steps of the sin/ cos unit are exact, i.e., there are no quantization steps involved. Hence, we only need to worry about the approximation steps. Because of the periodic and symmetric nature of sinðxÞ and cosðxÞ, as shown in Fig. 4 , we only approximate cosðxÞ over ½0; =2Þ. From the 16-bit input u 1 , the most significant two bits are used to select a random quadrant from one of the four quadrants of sinðxÞ and cosðxÞ and the remaining 14 bits are used for the polynomial approximation. Two variables, x g a and x g b , are obtained from the 14 bits, which are used to compute g 0 and g 1 .
The approximations required are cosðx g a =2Þ and cosðx g b =2Þ. Since the two approximations share the same data path characteristics, we shall discuss cosðx g a =2Þ only. In the following analysis, x g a will be referred to as x g for simplicity. We first need to decide what degree polynomial to use for the approximation: A low degree polynomial will require fewer computations at the expense of a larger table. In addition, shallower computation chains will accumulate fewer quantization errors. Hence, we would like to use the lowest degree possible as long as the table size is reasonable. This will, of course, depend on the function and the precision we are aiming for.
In Section 6.1, we derived that F B g0 ¼ 15. Knowing that the range reconstruction step only involves sign changes, we also need F B yg ¼ 15. The signal y g needs to be faithfully rounded to 15 bits fraction, i.e., E y g 2 À15 . When approximating a function with piecewise polynomials, we would like to know the minimal number of segments required for a given input range, polynomial degree, and output accuracy. A MATLAB program is written which uses MAPLE to compute the minimax polynomial coefficients and maximum approximation error for a given segment. It incrementally increases the number of segments by a power of two until all segments meet the user-specified output accuracy.
When approximating y g ¼ cosðx g =2Þ with an accuracy of 2 À15 , our MATLAB program reports that we need 128 segments for a degree one polynomial and 16 segments for a degree two polynomial. As a rough estimate of the coefficient table size, we assume that the coefficients have the same bit-width as the output y g , i.e., 16 bits. From (15), we can infer that table sizes will be roughly 4,096 and 768 bits for degree one and two polynomials, respectively. Given that a block RAM on Virtex-II and Virtex-4 FPGAs is 18Kb, enough to fit the 4,096 bits for degree one polynomials, we opt for a degree one polynomial. Based on the above information and our experience with piecewise polynomials, a good rule of thumb is to use degree one polynomials for target precisions lower than 20 bits and degree two polynomials for above 20 bits. Fig. 7 shows the data path for a degree one polynomial approximation to cosðx g =2Þ. It corresponds to line 48 and line 49 in Fig. 6 . Knowing that E yg 2 À15 and using the MiniBit techniques from Section 5, we get the following error expression:
Note that the 2 À16 term is the rounding error of y g . Because the function is approximated by a polynomial, there is an inherent approximation error E approx g , regardless of quantization effects. For this approximation, the maximum polynomial approximation error is reported to be 9:26458 Â 10 À6 by MAPLE. The polynomial coefficients are rounded to the nearest, hence
The error at D0 g is expressed as
The maximum value of x g B is one and, since x g B has no errors, we get and, by substituting (30) into (28), we obtain
We find that F B C1 g ¼ 18, F B C0 g ¼ 18, and F B D0 g ¼ 18 are the minimal bit-widths that satisfy the inequality. In the actual ROM, we store the two coefficients C1 g and C0 g as integers. The coefficients for C1 g all contain six leading zeros in the fraction part and some values of C0 g turn out to be slightly larger than one. Moreover, all values of C1 g and C0 g are found to have the same sign. Hence, the bit-width required for C1 g is 12 bits (with the six redundant leading six bits eliminated) and C0 g is 19 bits (with one extra bit to cover the integer part). As mentioned earlier, 128 segments are required for a degree one approximation to cosðx g =2Þ and, hence, the total table size needed for the sin/cos unit is ð12 þ 19Þ Â 128 ¼ 3; 968 bits.
Error Analysis for the Square Root Unit
The square root unit corresponds to line 22 to line 35 in Fig. 6 and is derived from (9) . It is perhaps the most challenging error analysis step since it suffers from propagation errors at its input produced by the logarithm unit. We shall discuss why the propagation errors make the error analysis difficult and how we overcome this problem.
Since u 0 is over ½0; 1 À 2 À48 , we know that the output of the logarithm unit e ¼ À2 lnðu 0 Þ will be over ½0; 66:54. u ¼ 0 is treated as a special case in which we simply set e ¼ 0. Since e is unsigned, the number of its integer bits will be seven. The leading zero detector (LZD) returns the number of leading zeros from the most significant bit. Since e has seven integer bits and we want x 0 f to be over ½2; 4Þ (i.e., x 0 f has two integer bits), we subtract the output of the LZD from five in order to obtain the number of bits to shift e.
The minimum value of exp f occurs when e has just a one in its least significant bit. Hence,
Looking at line 34 in the range reconstruction step, we have the following relationship between exp f and exp
The maximum value of exp f is five (i.e., when there are no leading zeros in e), hence the maximum value of exp 0 f will be
where exp f ½0 denotes the least significant bit of exp f . This shows how much the result at y f can be shifted to the left, which would amplify its error. By examining line 35 of the range reconstruction step, we get the following error relationship between f and y f :
and, from Section 6.1, we know that f should be faithfully rounded to 13 fraction bits. Hence, we get the following inequality:
From the expression above, we can see that the accuracy requirement of y f will depend on the value of exp 0 f . Based on (34), we know that maxðexp 0 f Þ ¼ 3, therefore y f should be at least accurate to 2 À16 . We shall make the assumption that e is faithfully rounded, i.e., maxðE e Þ ¼ 2 ÀF Be . Looking at line 26 of the range reduction step, we get the following error relationship between e and x 0 f :
and, therefore,
As discussed in Section 4, E x f B is a masked and leftshifted version of E x f by B x f A . The left-shifting by B x f A will amplify the error by 2 B x f A . Hence, we get the following error expression at x f B :
Based on the derivations above, we can now consider the polynomial approximation part. Since the accuracy requirement of y f is 16 bits at least, we opt for a degree one polynomial. For both cases, when exp f ½0 ¼ 0 and exp f ½0 ¼ 1, which correspond to the intervals ½2; 4Þ and ½1; 2Þ, 64 segments are required, meaning that B x f A ¼ 6.
Although we use two independent coefficient tables for the two cases, a single multiply-and-add unit for the degree one polynomial arithmetic can be shared between them. Using a similar analysis to the sin/cos approximation unit with the exception that the input x fB contains a propagated error E x f B and noting that maxðx f B Þ ¼ 1, we get the following error expression at the output y f :
Note that 2 À17 is the rounding error at y f . Recalling that minðexp f Þ ¼ ÀðF B e þ 1Þ a n d B x f A ¼ 6, t h e p r o d u c t C1 f E x f B is most likely to be the dominating error factor. Substituting (36) and (39) into (40) and considering only the dominating error factor, we get
Let us consider the case when exp f ½0 ¼ 0 first, i.e., when F B e is odd. It is found that maxðC1 f Þ ¼ 0:01101 when exp f ½0 ¼ 0. E y f will be at its maximum when exp f is at its minimum of ÀðF B e þ 1Þ. Using (41) 
Based on the two derivations above, we choose the case when F B e is even, which is 24 bits. Now that we have derived F B e , we need to compute the minimal bit-widths of the signals C1 f , C0 f , and D0 f in the polynomial arithmetic. E approx f is reported to be 5:32722 Â 10 À6 and 3:76332 Â 10 À6 for the cases when exp f ½0 ¼ 0 and exp f ½0 ¼ 1, respectively. Since the error requirement at y f can vary depending on the value of exp f , we shall consider the two extreme cases when exp f is at its minimum and maximum. At minðexp f Þ ¼ À25, we get
and, at maxðexp f Þ ¼ 5, we get
Given that F B C1 f is likely to be around 16 bits or larger, (50) places more stringent bit-width requirements on the signals than (49). Through enumeration on (50), we find that F B C1 f ¼ 18, F B C0 f ¼ 19, and F B D0 f ¼ 19 are the minimal bitwidths. As in the sin/cos units, some coefficients have leading zeros and some are larger than one. Hence, when we store the coefficients as integers, we need 12 bits and 20 bits for C1 f and C0 f , respectively. We need to store two tables for the square root unit: one each for the intervals ½2; 4Þ and ½1; 2Þ. Recalling that 64 entries are required for each table, the total table size is ð12 þ 20Þ Â 64 Â 2 ¼ 4; 096 bits.
Error Analysis for the Logarithm Unit
The logarithm unit corresponds to line 6 to line 20 of Fig. 6 . Examining the range reconstruction steps in line 18 to line 20, we find that there are the two intermediate signals: ln2 (which stores the constant lnð2Þ) and e 0 . We first need to determine the fraction bit-widths of these two signals and the output of the polynomial arithmetic y e . Noting that maxðexp e Þ ¼ B u 0 ¼ 48 and constant ln2 is rounded to the nearest, the following error relationship exists between the signals:
In Section 6.3, we derived the fraction bit-width of e to be 24 bits. Assuming y e is faithfully rounded, using (51), we get
Through enumeration, the minimal fraction bit-width are F B ln2 ¼ 32, F B e 0 ¼ 28, and F B ye ¼ 27.
Since y e needs to be accurate to 27 bits, degree one polynomials will likely require a large number of segments. Hence, we choose degree two polynomials for this approximation. The error analysis for the degree two polynomial arithmetic is performed in the same manner as the sin/cos unit case, where the input to the polynomial arithmetic contains no errors. After analysis, the minimal bit-widths for the three polynomial coefficients C2 e , C1 e , C0 e are found to be 13, 22, and 30 bits, respectively. The approximation requires 256 segments, hence the total table size for the logarithm unit is ð13 þ 22 þ 30Þ Â 256 ¼ 16; 640 bits.
IMPLEMENTATION
This section discusses how our Box-Muller hardware architecture is mapped into FPGA technology.
As can be seen in Fig. 3 , the operations involved in the Tausworthe URNG are rather straightforward and a series of XOR and constant shifts are needed. We also need multiplexors to reload the original seeds when the reset signal in Fig. 2 is set high. The two Tausworthe URNGs occupy just 150 slices and are fast enough not to require any pipeline stages.
To implement the leading zero detectors (LZDs) of the logarithm and the square root units, we choose the methodology proposed by Oklobdzija [27] . It allows us to implement any size LZDs using a 2-bit LZD as a basic building block in a hierarchical and modular manner. Fig. 8 shows a 2-bit and a 4-bit LZD. This LZD architecture occupies little area on the device, for instance, the 48-bit LZD used in the logarithm unit occupies just 46 slices. The LZD is fast and, hence, it is used combinatorially. Another important component of our design is the barrel shifter, which is required at the range reduction step of the logarithm unit and at the range reduction and reconstruction steps of the square root unit. We employ the logical barrel shifter described in Pillmeier et al.'s survey paper [28] . An example of an 8-bit logical left barrel shifter is shown in Fig. 9 . As an example, the 48-bit left barrel shifter used in the logarithm unit occupies 130 slices and has two pipeline stages.
In Section 6, we determined the table sizes of the three function evaluation units to be 3,968, 4,096, and 16,640 bits for sin/cos, square root, and logarithm, respectively, resulting in a total memory requirement of 24,704 bits. A Virtex-4 block RAM is reported to be 18Kb, but 18Kb can only be utilized for certain memory width and depth combinations. Due to this constraint, we are unable to fit the logarithm table into a single block RAM. We adopt a three block RAM structure to store the tables, as illustrated in Fig. 10 . In block RAM 0, the C2 e and C1 e coefficients of the logarithm unit are stored. They are concatenated to form a single long word. In block RAM 1, the upper 256 locations store the C0 e coefficients of the logarithm unit and the lower 128 locations store the two square root unit tables. This RAM is dual-ported since the logarithm and square root unit need to access their coefficients every cycle. Block RAM 2 stores the coefficients for the sin/cos unit. It is dual-ported to allow simultaneous reads for sin and cos evaluations.
Our Box-Muller architecture is mapped into FPGA implementations using Xilinx System Generator 7.1. The implementations are heavily pipelined to maximize the throughput/area ratio. Synplicity Synplify Pro 8.1 is used for synthesis and Xilinx ISE 7.1.03i is used for place-androute with maximum effort level. An implementation on a Xilinx Virtex-4 XC4VLX100-12 FPGA occupies 1,452 slices, three block RAMs, and 12 DSP slices. Fig. 11 illustrates hardware area comparisons of the various units in the architecture. It is capable of a maximum clock speed of 375 MHz, one of the rounding circuitry in the logarithm unit being the critical path. Since we can generate two samples per clock, a throughput of 750 million noise samples per second is attainable. There are 53 pipeline stages, hence valid noise samples will appear 53 clock cycles after the reset signal is set from high to low. In addition, we have used the noise generator for channel code simulations on an FPGA platform, where the hardware utilization, quality, and performance were confirmed.
Higher throughputs can be obtained by exploiting parallelism. We are able to fit 37 instances of the noise generator on a Xilinx Virtex-II Pro XC2VP100-7 device, which, in fact, has more area resources than the largest Virtex-4 device. They occupy 45,077 slices, 84 block RAMs, and 336 MULT18X18s (embedded multipliers). Due to routing congestion on the chip, we are only able to achieve a clock speed of 95 MHz. Despite the significantly lower clock speed, with the parallelism of the multiple instances, we are able to achieve a throughput of seven billion noise samples per second.
EVALUATION AND RESULTS
This section describes how we verify the accuracy of our noise samples and compare the performance of our design against other hardware and software implementations.
In order to test the accuracy of the noise samples, we compare 10 billion noise samples from our noise generator against the ones generated from IEEE double-precision floating-point arithmetic. As anticipated, the ulp error of all samples are found to be less than 1 ulp for all samples. To verify the accuracy of the samples in the high regions, we test 10 billion noise samples over the range of sigma multiples ½À7; À4 and ½4; 7 and, again, all samples are found to be accurate to 1 ulp. Throughout the tests, 95 percent of the samples are observed to be accurate to 1=2 ulp (i.e., exactly rounded), demonstrating the sheer quality of our noise samples. Fig. 12 shows a ulp error plot of 10,000 samples. We see that all samples are accurate to 1 ulp and most are accurate to 1=2 ulp.
Statistical tests, such as the 2 test or the AndersonDarling test [6] are not necessary since 1) we know that the derivation of the original Box-Muller algorithm itself is correct and 2) we generate the samples accurately within the 16 bits resolution. Fig. 13 shows the PDF of our noise samples for a population of 10 million, while Fig. 14 shows the PDF between 7 and 8:2 for a population of 10,000. In both cases, our noise samples closely follow the true Gaussian PDF.
Various hardware implementations of our Box-Muller architecture are compared against several software implementations based on the Wallace [9] , Ziggurat [7] , polar, and Box-Muller methods [6] , which are known to be the fastest methods for generating Gaussian noise on instruction processors. For the Wallace and Ziggurat methods, FastNorm3, available in [29] , with maximum quality setting and rnorrexp available in [7] are used. The software implementations are run on an Intel Pentium-4 3 GHz PC and an AMD Athlon-64 3000+ 1.8 GHz PC, both equipped with 2 GB DDR-SDRAM. They are written in ANSI C and compiled with the GNU gcc 3.3.3 compiler with -O3 optimization, generating IEEE double-precision floating-point numbers. In order to make a fair comparison, we use the Tausworthe URNG for all implementations. The Tausworthe URNG can generate 110 million 32-bit uniform random numbers per second on an Intel Pentium-4 3 GHz PC. The comparisons are shown in Table 1 . We see that our hardware designs are faster than software implementations by 11-5,408 times, depending on the device used and the resource utilization. It is also important to note that the software implementations could certainly be made somewhat faster by utilizing assemblerlevel optimizations. That said, the FPGA implementations are approximately two to three orders of magnitude faster in terms of throughput than the software implementations Fig. 12 . Error plot for 10,000 randomly generated samples from our noise generator compared against IEEE double-precision floating-point arithmetic. The noise samples have 11 fraction bits, hence the ulp is 2 À11 . Over 95 percent of the samples are exactly rounded. Fig. 13 . PDF of the generated noise from our design for a population of 10 million samples. The black solid line indicates the ideal Gaussian PDF.
Fig. 14. PDF of the generated noise from our design for a population of 10,000 samples between 7 and 8:2. The black solid line indicates the ideal Gaussian PDF.
TABLE 1 Throughput Comparisons of Various Hardware and Software Gaussian Noise Generators
The XC2VP100-7 FPGA belongs to the Xilinx Virtex-II Pro family, XC4VLX100-12 FPGA belongs to the Xilinx Virtex-4 LX family, the XC2V4000-6 belongs to the Xilinx Virtex-II family, while the XC3S5000-5 belongs to the low-cost Xilinx Spartan-III family.
and even the best assembler-level coding would leave a large performance gap. Thus, the message of Table 1 is that, as one would expect, the dedicated hardware implementation is significantly faster than what is achievable with programmable software platforms. Table 2 shows comparisons of our design against four other designs: the Gaussian noise generator block available in Xilinx System Generator 7.1 [17] , the Ziggurat design in [16] , our previous Box-Muller design [15] , and our Wallace design [14] . Note that the Xilinx block is based on Xilinx's AWGN core 1.0 architecture [12] . In order to make the comparisons fair, all designs are placed-and-routed on a Xilinx Virtex-II XC2V4000-6 FPGA and hand placementand-routing is not performed. We observe that our design has the best noise quality, the maximum obtainable multiple, and the best throughput/area ratio. The clock speed of our design is higher than others due to more aggressive pipelining.
CONCLUSIONS
We have presented a hardware Gaussian noise generator using the Box-Muller method to aid simulations involving large numbers of samples. The architecture involves a series of elementary function evaluations, which are computed using fixed-point arithmetic. In order to obtain minimal signal bit-widths while respecting the accuracy requirements, we perform error analysis based on the MiniBit framework [25] . Two 16-bit noise samples are generated every clock and, due to the accurate error analysis, every sample is analytically guaranteed to be accurate to one unit in the last place. The noise generator accurately models the true Gaussian PDF out to 8:2.
The design has been realized in FPGA technology. An implementation on a Xilinx Virtex-4 XC4VLX100-12 FPGA occupies 1,452 slices, three block RAMs, and 12 DSP slices and is capable of generating 750 million samples per second at a clock speed of 375 MHz. Further performance improvements are obtained through exploiting parallelism: 37 instances of the noise generator on a Xilinx Virtex-II Pro XC2VP100-7 FPGA can generate seven billion noise samples per second, which is over 200 times faster than an Intel Pentium-4 3 GHz PC. The noise generator is currently being used at the Jet Propulsion Laboratory, NASA to evaluate the performance of low-density parity-check codes for deepspace communications.
Future work includes applying our design methodology to other distributions, including Cauchy, exponential, and Weibull. These can be generated by evaluating the inverse cumulative distribution function (CDF) of the distribution [8] , which can be approximated via polynomials. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
