A module generator is described that allow.7 for the generation of synthesizahle VHDL modules which implement arbitrary functions in f u e d point precision using the Symmetric Table Addition Method ( S T M ) . This module generator was interfaced to a high level synthesis tool "fly" which automatically generatesfully-pipelined circuits from a Perl-like language. The resulting system was applied to the N-body problem and results an! presented. It was found that afunction generator module is a very useful addition to a hardware description language.
Introduction
The field of reconfigurable computing has benefited greatly from the development of efficient hardware description languages as well as improved field programmable gate m a y (FPGA) device capacity in recent years. These two developments have greatly improved our ability to develop FPGA based hardware accelerators for computing applications. A large number of computing applications require the use of real arithmetic, and FPGA designers use both fixed and floating point representations in such systems.
In software based systems, libraries of commonly used mathematical functions are available, and programmers need not be familiar with the details of their implementation. In hardware design, although there are libraries available to perform specialized functions, they are normally restricted to a small hxed set of functions e.g. implementations of the CORDlC algorithm.
In this paper, an efficient table lookup generation system for supplementing a hardware description language (HDL) is proposed. In particular, an implementation of the Symmetric Table Addition Method (STAM) which acts as a module generator for an arbitrary twice differentiable function is described. This module generator was integrated with OUT high level synthesis tool "Py" [3] to produce a very flexible design environment which allows the specification of arbitrary functions in a high level manner. The resulting environment was used to develop a coprocessor for the computation of the N-body problem.
The N-body problem is computationally intensive and involves a large number of floating point operations. This together with the fact that relatively low precision is required makes it a good candidate for hardware acceleration using FPGAs. Using a module generator provides further advantages by allowing the designed system to he adapted for different application or different algorithms with very little effort. The design productivity of the module generator approach was much higher than that which could be expected from a VHDL designer.
The rest of the paper is organized as follows. In Section 2, an introduction to the N-body problem is given and the need for a function approximation justified. In Section 3, the STAM table lookup algorithm is described and extensions made to VHDL to allow for the simple interfacing of the STAM code detailed. An example which describes the application of tbe resulting STAM system to tbe implementation of the floating point function f(z) = z 2 is also given. Seclion 4 describes tbe implementation of the N-body problem. Results obtained from this implementation are described in Section 5 and finally, conclusions are drawn in Section 6 .
The N-body Problem
A wide range ofphysical systems can he studied by modeling them as an N-Body problem and are used in fields of science such as astrophysics and molecular biology. In the N-body problem, particles are modeled as points in space.
The potential of the system can be expressed as a function of the properties and positions of all particles in the system and the force exerted on a particle is the first derivative of this potential with respect to the position of the particle. Different applications of the N-body problem share the same basic structure but differ in the physical law that govems the force between particles and the exact equation for calculating the potential and force depends on the application. By integrating the force acting on aparticle, its position can he computed as a function of time.
There is no known analytic solution for the N-body problem for N 2 3 so N-Body problems are solved numerically in practice via simulation. At each time step, forces exerted on each particle are computed and the positions of the particles are updated at the end of the time step by integrating all forces acting on all particles. In N-body simulations, most computation time is spent on force calculation and the number of interactions between particles grows as O(nZ), where n is the number of panicles. For large n, this calculation becomes very expensive and poses a limit on the size of system that can be realistically studied.
Since the force calculation part consumes most of the execution time, and at the same time has a rather simple algorithm, it is agood candidateforhardware acceleration. In fact, this has been done in many systems, usually employing a heterogeneous architecture consisting of a general purpose host computer and a special purpose hardware. The special purpose hardware handles the force calculation while the host computerhandles all other computations. Most notable of those is the GRAPE (Gravitational Pipeline) computer for the gravitational N-body problem [5J.
In this work, an FPGA based co-processor for evaluating gravitational forces in N-body simulations was built using a module generator approach. The architecture of the co-processor is similar to the GRAPE-1 system. GRAPE-1 is the first in a series of specialized processors evaluat- 
The equations are the same as that implemented in the GRAPE-1 system. a; is the gravitational acceleration at the position of particle i, x; is the position of vector particle i, T ;~ is the distance between particles i and j and t is the artificial potential softening used to suppress the divergence of the force at ~i j + 0. Implementation of the above equations is reasonably straightforward using multipliers and adders, with the exception of a function to compute x -s / 2 . The implementation of arbitrary functions is a tedious task and hardware description languages such as VHDL do not offer support to aid their implementation. In the following sections, we describe the STAM table lookup algorithm which can approximate any twice differentiable functions to a desired accuracy and then describe its integration into VHDL.
Table Lookup Approximations

Taylor Expansion
If a function f (x) has continuous derivatives up to (n f I)'h order, it can be expanded using a Taylor series:
To reduce the required hardware resources and/or computation power, only the first few terms in the Taylor series can be used to approximate the function.
Symmetric Bipartite Table Method (SBTM)
The SBTM uses the first two terms of the Taylor series to approximate a function f(x) as f(z) [6] . In the SBTM, two lookup tables are constlucted and the precision of the output is maximized. 
I
The ranges of x 1 are: Two lookup tables which r e m the values a0 and al are then constructed. The sum of these two values will be an approximation to the function. We &st select mid points in the ranges of x1 and x2:
Let a = xo + X I + 62 and use the first two terms of the Taylor Expansion: 
1x2 -621 5 -
XmCU
2
The upper bound of al is
where
Note that even though the domain of f(z) is 10, l), the range of a0 = f'(x)(x -a) may include negative numbers.
From equation (8), one can calculate the number of leading Os (or leading Is for a negative number) to be
To maximize the precision of the algorithm and optimize the resource utilization, errors in approximation must be controlled. In the SBTM algorithm [6], the following inequality ensures that the error will be less than the LSB of lhe result:
(1 +2-"1)+2-pf-g-1
The first term is the error contributed by the Taylor ap- 
Symmetric Table Addition Method (STAM)
The logic in the SBTM is very simple and only two tables are required. The STAM algorithm uses more tables of smaller size to significantly reduce the overall memory required [71.
As shown in Fig 3, where The error analysis of the STAM is similar to the SBTM algorithm. The constraint? for the parameter configuration are:
Input Range Scaling
The above analysis was based on the input range x E [0,1). Both SBTM and STAM can be adapted to other input ranges by first making a linear transformation of the actual input range to this range without changing the value of f (4.
For an n-hit input x i , let E he the transformed value and assume the decimal point is right of the LSB. If the input range is [ $ , , , i n , xmaz), then
X i =+ 2i = 2;;(5""" -%") +%in
VHDL Extension
To allow easy usage of the STAM algorithm in VHDL designs, some simple extensions were introduced by m a k ing use of the comment section inside VHDL code segments. The user includes the name and the body of the target function as well as some configuration parameters.
The following example demonseates the instantiation and usage of a sin function in the VHDL source.
architecture . Four tables are generated for the 16-bit input. The decimalpoinr statement indicates that decimal point is located at the left side of the most significant hit. The output will he ready after the next tising clock edge and will be valid as long as the input x is valid. The clock signal is required since synchronous RAMS are used to store the tables.
__ -
The VHDL codes are first passed to a preprocessor before going to the synthesis stage. A flow chart of the preprocessor is shown in Fig 4. First, the function extractor
Range reduction and result correction are necessary in the floating point implementation. Consider the BEE 754 binary floating point number representation: extracts the function body in the extended VHDL block and passes it to YACAS (YACAS is public domain software which perfoms symbolic arithmetic operations [2] ). YACAS accepts the input function, finds the symbolic expression first and second derivatives and passes the results to the table generation program. The table generation program transforms the input strings to a sequence of arithmetic operations and generates the content of the lookup tables. These contents will be used in the VHDL generator to generate synthesizable VHDL code using Xilinx Block-RAM [E] as the lookup tables.
With this extension, an arbitrary twice differentiable function can be used in VHDL without knowledge of the detailed implementation. The default evaluation time is 1 clock cycle but this can be eacily modified in the generated stluctural VHDL codes. This preprocessing method can be easily adapted for other HDL languages such as Verilog with minor modifications.
Floating Point Extension to Fly
The extendedfly compiler supports basic floating-point operations, such as addition, subtraction and multiplication with different precisions. Transcendental functions such as square root and exponential are frequently required to evduate the force or acceleration in the N-body problem. Such functions can be implemented using the modified STAM approach. This section will describe the development of a floating point coprocessor for the N-body problem which computes U-'/'.
In the original STAM algorithm, the input value is considered to be a fixed point number within a predefined range. It is possible to modify the logic so that it can handle specific knctious for floating-point arithmetic.
When e is even, let e = 2 N , equation 19 becomes:
Similarly, if e is odd, let e = 2 N + 1, equation 19 becomes:
In both cases, the fractional part can be calculated using STAM with the input range [1,2), and the exponent part implemented using shift and add operations. If e is odd, the final result should be multiplied by 2-"'.
Normalization of the STAM output is required to convert it back into a floating point format. Since for 1 5 U < 2 the output of STAM has a range 0.354 < U-~I' < 1, the location of the leading one must be one of the two most significant bits. The datapath of the calculation is shown in fig 5. Since it supports arbitrary size floating-point numbers, the precision can be modified to suit the application.
To implement the circuit on an FPGA, tbefly [3] compiler was used to generate synthesizable VHDL code and a Pilchard board [4] was used as the reconfigurable platform. Pilchard uses a DlMM memory bus interface which provides high U 0 performance compared to the PCI bus. Fly can take a Perl-like description of an algorithm and fully supports floating-point arithmetic of arbitrag precision. In addition, the mechanism of thefly compiler allows for easy integration of a block such as STAM. Thefly compiler was modified such that it can handle the _power15 ( ) function using the built-in function mechanism.
Due to the limited amount of memory available on the FPGA chip, the STAM used 16-bit integers as input and the table size is (8,2,2,2,2). STAh4 was used to compute the function f(z) = c 3 l 2 where 1 5 z < 2. Note that since STAM requires an input 0 5 5 < 1, the transformation 2 = z -1 was applied as described in equation 17. Thus Figure 6 shows the datapath of the design which implements Equations I, 2 and 3. A program written in the& language was used to implement the equation 2, the inner loop of an N-body simulation.
Implementation
The input of the program is x i , x j and t while the output is the acceleration ( s j ) for a particular value of x j . Most of the constructs are parallel in nature so that the vector manipulation can be processed simultaneously. For example, xj -x, can be done at the same time for each scalar in x j and xi. By appropriately inserting assignment statements in the program for pipeliuing, a fully-pipelined core can be generated using thejly compiler. The resulting VHDL code can produce a new value of a,j every cycle. In addition, thepy code can used for simulation and verification by.directly executing it under the Per1 environment [3] , saving time and reducing errors compared with manually translating the algorithm description into VHDL. The summation in equation I was done outside thefly core and a floating point accumulator was used to compute and store the value ai.
The floating point module supplied by thejly compiler is readily parameterized so the tradeoff between accuracy and resources is adjustable. Different sizes of fraction and magnitude can be implemented from the same description.
As t h e j y core used a register interface, in order to allow the design to run at maximum speed, a dedicated host interface, as shown in Fig 7, was developed so that it can provide a new xj every clock cycle. "here are two input buffers each for x j and x i . Dual-port BlockSelect RAMS are used to consmct these buffers in order to reduce area usage and isolate the clock domains of interface and core units. The host program first fills the x j buffer. Tbe'design will loop through the x j buffer for each xi received from the host. By doing so, an N-body problem requires N x 2 write operations instead of N 2 as long as the input buffers are large enough to store all N data. Also, the core design can start a new computation every clock cycle if the xi huffer is not empty. Thus the pipelined design can be fully utilized to generate outputs at throughput of I result per clock cycle. If N is too huge then not all xj's can be stored in the input buffer and the host program is required to partition the problem and perform a final correction.
Results
The N-body simulation using the STAM for function approximation was implemented on a Xilinx XCVIOOOE-6 FPGA with a total of 10260 slices for the 32-bit configuration. The number of BlockSelect RAM used in the design was 32 for all configurations.
In order to test our design. the NEMO toolkit [I] In figure 8 , the average quantization error is plotted against the fraction length. As expected, for fraction lengths below 15-bits. the quantization emor decreases linearly with the fraction length. The graph tails off at longer word length due to the limited precision of the STAM table used in this implementation. The precision required for an N-body simulation depends on the underlying problem being solved. Using our implementation, the user can choose the fraction length which best suits their requirements.
In figure 9 , the clock frequency reported by the design tools is plotted against the precision. Ail configurations run successfully at 50 MHz. In figure 10 , the area in slices is plotted against the precision. It can be seen that the area increases linearly with fraction size. Since the design is parameterized, designs with different fraction size can be easily generated from the same description. If the application calls for a configuration with fraction width less than or equal to 15-bits, it is possible to achieve better performance by implementing 2 pipelines on the chip. With 2 pipelines running in parallel, the design can potentially achieve a peak speed of 3 GFLOPs. Since the magnitude of force between particle pairs varies greatly, quantization error introduced by the accumulator in the long chain of summation operation can significantly degrade the accuracy of the result. It was found that using a longer accumulator fraction size greatly reduces quantization error without significantly increasing the area.
Thus, 32-bit floating point accumulators were used for all configurations. Table I 
.Conclusions
A flexible framework for applying the STAM method has been presented. A preprocessor was developed which can be used to generate syuthesizable VHDL modules which implement arbitraty user defined functions from cnmments inside VHDL source code. This function generator was integrated into the& environment to extend its flexibility and efficiency. An N-body problem simulation was implemented using this framework to demonstrate the utility of such an approach. Without detailed knowledge of the STAM implementation, a fully pipelined N-body core was generated from 246 lines o f f l source code. Furthermore, implemeutations with different floating point precision can be implemented from the same source code, facilitating the analysis of floating point precision, area utilization and performance. This framework can be used to solve real world problems with minimum design effort and with sacrificing performance.
