Abstract-This paper describes TANOR, an automated framework for designing hardware accelerators for numerical computation on reconfigurable platforms. Applications utilizing numerical algorithms on large-size data sets require high-throughput computation platforms. The focus is on N-body interaction problems which have a wide range of applications spanning from astrophysics to molecular dynamics. The TANOR design flow starts with a MATLAB description of a particular interaction function, its parameters, and certain architectural constraints specified through a graphical user interface. Subsequently, TANOR automatically generates a configuration bitstream for a target FPGA along with associated drivers and control software necessary to direct the application from a host PC. Architectural exploration is facilitated through support for fully custom fixed-point and floating-point representations in addition to standard number representations such as single-precision floating point. Moreover, TANOR enables joint exploration of algorithmic and architectural variations in realizing efficient hardware accelerators. TANOR's capabilities have been demonstrated for three different N-body interaction applications: the calculation of gravitational potential in astrophysics, the diffusion or convolution with Gaussian kernel common in image processing applications, and the force calculation with vector-valued kernel function in molecular dynamics simulation. Experimental results show that TANOR-generated hardware accelerators achieve lower resource utilization without compromising numerical accuracy, in comparison to other existing custom accelerators.
hardware, the high-level or behavioral description has to be translated into a representation at the register transfer level (RTL). The first and rudimentary step in the process is the conversion of double-precision floating-point (FP) MATLAB code into a fixed-point (FX) or customized floating-point version. Trade-offs, such as data precision, rounding modes, and signed/unsigned representations, are performed at the MATLAB description level. Next, hardware designers take the specifications and requirements from the developer to create a physical implementation in a hardware description language (HDL) such as VHDL or Verilog. Finally, the RTL HDL code is synthesized, placed, and routed (P&R) onto an FPGA platform. The manual realization of the traditional RTL design flow tends to be tedious and error-prone.
We describe in this paper, TANOR, a tool for generating accelerators for numerical computation on reconfigurable platforms [5] . It is intended for computation applications with very large-sized data input, and with high accuracy and high-throughput requirements. We illustrate the tool with algorithms for simulating N-body interaction problems, which have applications in molecular dynamics, celestial mechanics, plasma physics, fluid mechanics, and semiconductor device characterization. TANOR is equally applicable to a wide variety of traditional DSP applications, such as digital filters and transform computations, including Discrete Cosine Transform (DCT) and Fast Fourier Transform (FFT). The design of TANOR is aimed at the following objectives:
. Support for design specification of certain numerical algorithms in a high-level language (MATLAB), making joint exploration of algorithmic and architectural variations feasible and accessible to application designers. . Support for fully pipelined implementations in both floating-point and fixed-point formats, standard or customized. . Optimizations with respect to area, power, accuracy, and throughput in a fully automated design flow from MATLAB specification to HDL synthesis. . Short design cycle due to development of accurate estimation models and evaluation mechanisms. We demonstrate the effectiveness and adaptability of TANOR using three different target applications, namely, the calculation of gravitational potential in astrophysics, the diffusion or convolution with Gaussian kernel common in image processing, and the force calculation with vectorvalued kernel function in molecular dynamics simulation.
The rest of the paper is organized as follows. Section 2 provides the relevant background and a brief summary of related work. Section 3 presents an overview of TANOR. Sections 4 and 5 elaborate on key hardware and software modules. Section 6 illustrates the capabilities of TANOR through design examples. Section 7 concludes the paper.
BACKGROUND AND RELATED WORK

High-Level Synthesis Flow
There are a number of existing design flows that bridge the gap between algorithm design and architecture development. An important component of these works is a seamless translation from a high-level domain, where algorithm designers work, to hardware (HDL based) and software (C based) specifications for a target architecture. This is essential for avoiding time-consuming and error-prone manual intervention steps. Furthermore, it aids in algorithm exploration by allowing one to rapidly analyze the effects of different algorithmic parameters on system performance.
Existing automated design tools are either based on IP block or language translation. IP-block-based tools provide parameterizable IP libraries for common DSP functions and corresponding hardware library to which these behavioral functions can be directly mapped. For example, in [6] , [7] , the authors use an IP-block-based design environment to generate SystemC-based architecture specifications, starting from a netlist of functional modules described in MATLAB. A similar methodology is presented in [1] for multiprocessor system-on-chips (MPSoC) target architectures.
There exist several commercial IP block solutions as well. FPGA vendors provide MATLAB/SIMULINK parameterizable IP libraries (known as blockset) and corresponding hardware IP libraries targeting specific FPGA devices, such as Xilinx System Generator [8] . In these design flows, a graphical design description in SIMULINK is automatically synthesized to a HDL description. However, these flows have two primary limitations. First, the hardware libraries are usually vendor specific, limiting the portability to other target architectures. Second, they are library specific, requiring substantial manual intervention if one wishes to explore different algorithmic alternatives.
Language-translation-based tools, on the other hand, specify the input in high-level language. Examples of such tools include "ImpulseC" [9] , "CatapultC" [10] , and "Cameron" [11] , which map C-like script applications to FPGAs, "Calypto" [12] , and "Sequential Equivalence Checking Tool" [13] , which map SystemC scripts into RTL implementations by checking sequential equivalence of a system-level model and its RTL version. The most prominent is perhaps AccelDSP by Xilinx, which takes input description in MATLAB. AccelDSP [14] (previously MATCH project [15] in Northwestern University, then AccelChip [16] ) provides a GUI-facilitated design environment to translate MATLAB input specifications into HDL. It features an automatic floating-point to fixed-point conversion module and integrates seamlessly with MATLAB to facilitate quick fixedpoint simulation. Several implementation choices are provided by the integrated IP library for nontrivial DSP functions, such as FFT. Despite these features, AccelDSP has some limitations. The bit-width analysis method used is based on dynamic simulation and requires the user to provide accurate and comprehensive input test vectors in order to determine accurate bit widths of each variable. Furthermore, it may not be suitable for high-performance scientific applications due to limited I/O precision (32 bits only) and lack of floating-point-based data path support. In addition, AccelDSP only generates fully flattened designs which, in many cases, renders the HDL code difficult to be read and changed. TANOR overcomes these limitations by automatically determining the bit width of each variable, supporting customzed fixed and floating-point architectures, and finally, maintaining hierarchy in the generated HDL that is consistent with the initial MATLAB code.
Scientific Application: N-Body Problems
While TANOR supports scalarized MATLAB code for a variety of DSP algorithms, its true capabilities are manifested in large input designs with high accuracy and throughput requirements. We use N-body simulation problems to illustrate the potential.
At any time step in an N-body simulation, one deals with the interactions between a set of M source particles and a set of N target particles. We may assume M ¼ OðNÞ. The aggregation of the interactions from all source particles to a target particle is governed by a physical law, namely, an interaction function [17] . A naive calculation of the interactions at all target particles takes OðN 2 Þ arithmetic operations. Existing accelerators include GRAvity PipE (GRAPE) project [18] , MD-GRAPE [19] , [20] , MD Engine [21] , [22] . All these hardware systems are specialized ASIC-based solutions for specific N-body applications. Reconfigurable systems utilizing FPGAs have been developed in [23] , [24] , [25] , [26] , [27] . Most of them are also customized for a particular interaction function.
There is an increasing need for an automated system design and generation to support various N-body problems, as recently noted in [18] . The GRAPE PGR system [28] is one of the recent efforts in developing a design framework, but currently it does not support optimizations in the algorithm-architecture design space. Also, some early efforts made toward modeling molecular dynamics using SRC Computers in [29] , [30] , involve specialized language and manual intervention in many key steps, such as extracting data dependency information (used in timing analysis). In contrast, TANOR provides a fully automated framework for different N-body interaction problems. It takes algorithmic description in MATLAB as input and employs both algorithmic and architectural means to explore and optimize system design without manual intervention. Fig. 1 shows a block diagram of the design flow that is fully automated by TANOR. The input consists of an algorithm description in MATLAB (for an N-body problem, for instance, it is the kernel function describing the interactions), a description of the data sets, optimization objectives, and constraints on accuracy, latency, area, and power. It is provided through a graphical user interface (GUI). Since MATLAB is used as the specification language, we do not require a separate software emulator for functional verification of the design in the early stages of the design flow. The output is an efficient system in which the host computer interacts with the FPGA accelerator for enhancing the performance of the numerical algorithm under consideration.
TANOR DESIGN FLOW
The first two stages of the automated design flow are the code parser and analyzer (CPA), and the algorithmic optimizer. After these two stages, two different flows are adopted to generate hardware modules for FPGA boards and software modules for the host computer. Hardware modules are purely logic blocks composed of three different modules including kernel function, data-flow control, and data interface. Hardware modules are accompanied with Xilinx scripts to run the Xilinx synthesis and P&R tool flow without user intervention. The software module generates the host program to run the FPGA platform.
The input interface of GUI is composed of two tabs: algorithmic selection and architectural selection, as shown in Fig. 2 . In the algorithm selection tab, a user can write the kernel function in an editor or select from a set of predefined kernel files. Information, such as dimension, input data files, output directory, and matrix tiling method (optional), can be chosen. In the architecture tab, the precision representation can be selected. The user can also specify the optimization techniques used in the algorithm optimizer stage using the GUI. Based on the settings, TANOR generates the required design automatically.
Code Parser and Analyzer
The first step in the TANOR design flow is the code parser and analyzer (CPA) module which processes the kernel specification along with the constraint information provided by the user. The MATLAB operations that are currently supported by TANOR are listed in Table 1 . This set is compact and sufficient for supporting a variety of interaction kernels.
The code parser and analyzer phase transforms the input specifications into an intermediate representation known as Abstract Syntax Graph (ASG). Each node of this graph represents either an input or a computation, and edges capture the dependencies between them. More importantly, this graph reveals the opportunities for optimizations, such as common subexpression elimination (CSE). For functions with multiple outputs, the common computations among the different outputs can be extracted and reused resulting in optimal resource usage. Fig. 3 shows detailed operations of the code parser and analyzer.
Algorithm Optimizer
The algorithm optimizer operates on and transforms the ASG using one or more of the following schemes, namely, function evaluation using Taylor polynomials, interaction matrix tiling or clustering, and data traversal within a cluster and across the clusters.
Interpolation schemes. TANOR employs a lookup table (LUT) approach for function evaluation. It considers the trade-off between the memory space and logic resources while respecting numerical accuracy requirement. In its present version, the LUT approach is based on the truncated expansion of the Taylor series of a smooth function f
where x 0 is the reference or sample point closest to the evaluation point x. The Taylor polynomial coefficients at x 0 up to the dth degree are precomputed and stored in the LUT. The same numerical accuracy can be achieved by using more sample points, and hence, a larger LUT with lower degree approximation, or a small table with higher degree approximation. Thus, there is a trade-off between the memory usage and the logic resource consumption and the scope of the trade-off is subject to the design constraints and conditions. This LUT approach has been used for evaluating trigonometric functions, square-root extraction, logarithmic and exponential functions, and more complex functions such as the Bessel functions. Details can be found in [31] .
Interaction matrix tiling. A tile in an interaction matrix represents the interaction between a source cluster and a target cluster. The tiling and the traversal ordering affect the accuracy, latency, as well as area and power of a hardware implementation. Currently, TANOR supports two types of tiling schemes: Plain tiling (PT) and Geometric tiling (GT).
PT is a conventional matrix partitioning technique for enhanced cache performance, where the source and target data sets are partitioned into small blocks of predefined lengths. GT is a new technique which partitions the matrices based on their geometric and numeric structures. The partition process involves "binning" of the input samples in to different tiles based on their positions or coordinates. One can think of binning as a process in which a few significant bits of the coordinates are read and the samples are placed into the corresponding bins. Thus, partitioning is an OðNÞ operation, where N is the number of samples. For general N-body simulations, the change in particle coordinates with time is incremental, and repartitioning is not done until the particles travel beyond the tile boundaries. Details for GT have been provided in our earlier paper [31] . Fig. 4 shows the numerical ranges of the input samples in each tile after applying PT and GT. It shows that GT has the same effect in tuning cache performance as PT. In addition, it also aims at reducing the dynamic range of numerical values per tile and across tiles. This reduces the required bit width without compromising numerical accuracy and saving power and hardware resource consumption. A consequence of this is that higher accuracy can be achieved with reduced resources, as described later in Section 6. Data traversal. This is important in large-sized N-body problems, where a summation of the force exerted on a certain target by all possible sources is performed. This requires a tile sequence to be generated in which every target tile is followed by all possible source tiles. The ordering of these tiles has a significant impact on numerical accuracy in accumulation of the computed results. When tiled intelligently, this can result in good performance with respect to accuracy while reducing the hardware resource requirement. Additional details are included in [31] , [32] .
Examples of Algorithm-Architecture Coexploration
As a first example, we consider the calculation of the gravitational potential. We let TANOR use GT, along with a data traversal scheme in which the source tiles are first arranged in decreasing order of their distance from the target bins. This potentially allows the accumulation of data in nondecreasing order of magnitude, resulting in higher accuracy. In contrast, if PT is used, this data traversal scheme is not beneficial because the dynamic range within a tile and across tiles is roughly the same. Thus for the same accuracy, the GT-based scheme requires lower precision hardware than PT-based scheme, thereby saving both in terms of area and power, as will be demonstrated in Section 6.6.1.
For the second example, consider a case where TANOR uses an LUT-based implementation for realizing computationally intensive interaction functions. Typically, LUT schemes allow limited trade-offs between logic resources and table size, which are achieved by varying the degree of the Taylor polynomial or by using a different interpolation or approximation scheme. If both GT and PT are considered on top of this, the design space becomes much larger. With GT, the dynamic range of the numerical values in each tile is restricted, and thus, only one segment of the table needs to be loaded for processing a tile. In contrast, if PT is used, each tile exhibits roughly the entire dynamic range, and thus, the complete table would have to be available [31] .
After the algorithmic optimization phase, the system is partitioned into software and hardware modules, as shown in Fig. 1 . The task of the hardware module (see Section 4) is to generate the configuration file for programming the FPGAs. The software module (see Section 5) generates the host computer program (in C++) that interfaces the FPGA with the host computer.
TANOR Hardware MODULES
The primary objective of the hardware module is to generate an efficient accelerator that can be mapped onto an FPGA. The generated architecture is composed of three main blocks (Fig. 5) : Kernel function pipeline block, Data flow control block, and Host-FPGA interface block. We describe them next.
Our system is currently targeted to support Synplify Pro 8.6.2 and Xilinx ISE 8.2i as back-end tools to generate the bit file to be configured on the FPGA. Perl and Tcl scripts that are required by the back-end FPGA compiler are also automatically generated.
Kernel Function Pipeline Generator
The Kernel Function Pipeline Generator (KFPG) block generates efficient, synthesizable HDL code for the interaction function under consideration. It supports IEEE-754 Floating-Point Standard and custom floating-point formats where the exponent and mantissa widths are user defined, as well as fixed point, where the precision of the internal variables is automatically generated. The data path operators are listed in Table 2 .
The input to the KFPG block comes from the CPA block described in Section 3.1. It is an optimized description of the interaction function in single operation form (SOF), as shown in part (a) of Fig. 6 . The SOF file is processed by the KFPG module to generate an HDL description that can be mapped to hardware. The key steps are depicted in Fig. 7 . Among these steps, bit-width analysis and binary point control are only applicable for fixed-point implementation, while the other steps are common for both fixed-and floating-point implementations.
Design compilation. The design compilation step transforms the behavioral specification of the design into an internal graph-based representation, known as data-flow graph (DFG) [33] . This process is essentially a one-to-one transformation of every operation specified in each operand of the SOF into a node of a DFG. The DFG is represented by nodes, where each v i 2 V represents either an operation or an input and a set of edges E ¼ fðv i ; v j Þ; i; j ¼ 0; 1; :::; ng. A directed edge e ij from v i 2 V to v j 2 V exists in E if v j is generated as a result of operating upon v i . An example of this SOF to DFG transform is shown in Fig. 6 .
Bit-width analysis. The goal of Bit-width Analysis is to determine the data-type information for each internal variable, including the bit width, the number type, the truncation mode, and the overflow mode. This is an important optimization step because the bit-width information is used by downstream modules to determine the latency and size of hardware components for implementing the corresponding operations on these variables.
In case of floating-point (FP) implementation, the user has the option of selecting the widths of mantissa and exponent fields. This choice is typically based on area and accuracy constraints of the application.
In the case of fixed-point (FX) implementations, the user only specifies the dynamic range of the inputs and the maximum allowed bit width. The Bit-Width Analysis extracts fixed-point precision parameters (integer and fraction widths) for each variable. Specifically, it traverses the DFG and calculates the dynamic range of each node using one of the two range estimation techniques, namely, Interval Arithmetic (IA) and Affine Arithmetic (AA) [34] , [35] .
In IA analysis, each quantity x is represented by an interval "
x ¼ ½x lo ; x hi ]. These intervals are added, subtracted, multiplied, etc., in such a way that each computed interval " x is guaranteed to contain the unknown value of the corresponding real quantity x. Some rules used in TANOR for dynamic range calculation are shown in Table 2 .
For large DFGs, IA is vulnerable to the range explosion problem. In that case, TANOR applies Affine Arithmetic (AA) [34] , [35] , which solves the range explosion problem by keeping track of correlations among intervals. In AA, the uncertainty of a signal x is represented by an affine formx, which is a first-degree polynomial
Each i is an independent uncertainty source that contributes to the total uncertainty of the signal x. Some basic arithmetic operations in AA form provided in TANOR are as follows [34] :
Once the dynamic range for each variable is decided, TANOR applies the bit-width analysis algorithm [36] shown in Fig. 8 to compute the fixed-point bit-width parameters of each node in the DFG. 
Binary point tracking.
Àn . In other words, the fractional parts of y and z have to be matched before the integer parts can be added. This may require zeros to be appended after the LSB of the operand. Some binary point tracking rules used in TANOR are shown in Table 3 when the operand has bit width ½I; F , where I and F are the integer and fractional bit widths, respectively.
Scheduling. Scheduling puts a time stamp to each task in the behavioral specification, where time is measured in number of clock cycles in the case of synchronous systems. The scheduling is done in three phases. In the first phase, the latency of each operation is identified. The TANOR hardware library contains primitive hardware modules for which latency can be estimated as a function of the bit widths of input and output operands, as shown in Table 4 .
In the second phase, TANOR utilizes an "As Soon As Possible" (ASAP) scheduling scheme to generate fully pipelined implementation of the behavioral description. ASAP is a minimum latency schedule obtained by topologically sorting the vertices of the sequencing graph in depthfirst order. In the third and final phase, the number of delay elements that must be introduced to synchronize the operations, are identified. This is done using lifetime analysis, where the lifetime of each variable is obtained from the scheduled DFG. This helps in the generation of a fully pipelined implementation of the behavioral description. Fig. 9 presents an example design during the scheduling step. The input DFG is obtained after bit-width analysis of the DFG described in Fig. 6 . Delay stages are inserted to fully pipeline the design, as shown in Fig. 9b . TANOR uses custom-length shift registers to facilitate variable delay insertion.
Mapping. In this step, each operation specified by a node (except for inputs) of a DFG is mapped to a functional unit from the list of available library components. The current list of available library components is shown in Table 5 . There exists a one-to-one correspondence between each operation type and library component. For example, an add(þ) operation is mapped to LogiCORE Adder/Subtracter v7.0 [37] provided by Xilinx. Lookup-table-based operations are mapped to BRAM's by inferring behavioral code. The delay elements that are required for synchronization are mapped to custom-length shift registers. The depth and width of the shift-register stages are determined by the lifetime and bit width of the variable, respectively.
Output generation. The output of the design is created in a format that is easily processed by downstream synthesis, and P&R tools. We use VHDL as the output format.
Error tester generation. The final step in KFPG module is generation of MATLAB functional models both for fixedpoint and floating-point designs. These models accurately model the behavior of the hardware system. The user can directly substitute these models in the verification environment to quickly analyze the mathematical errors that are introduced due to reduced precision of the implemented hardware kernels.
Area and Power Estimation Block
TANOR provides an area/power estimation block with the KFPG block to facilitate early design space exploration. It is used to determine whether a particular implementation will fit in the target FPGA without waiting for the time-consuming synthesis and P&R steps. Accurate area and power estimation models for implementations using Xilinx Virtex-2Pro FPGA family have been described in [38] . Area models have been derived for the library components listed in Table 5 for both fixed-and floatingpoint data. The inputs are the bit widths of the operands and the outputs are given in terms of number of slices, block RAMs, and 18 Â 18-bit multipliers. Power models have been derived which take the area estimation results as inputs and provide estimates for logic power, signal power, clock power, and I/O power. In all cases, the model coefficients have been derived by using curve fitting and nonlinear regression analysis [39] . Validation of these models is shown in Section 6.6.3.
Multiply and Accumulate Block
In many matrix computations that involve large data sets, accumulation with high accuracy needs to be supported. The accumulation can be done either off-chip in the host computer or on-chip using a customized MAC block. Fig. 10 shows the block diagram of an on-chip MAC block. The bit widths of the mantissa and exponent can be varied according to the application requirement. The default precision of the MAC block is set to be single-precision floating point. The intermediate result is stored in a local buffer utilizing an FIFO due to the latency associated with floating-point multipliers and adders. The state machine for scheduling timing for the MAC block takes into account the latency of the variable precision multipliers and adders. This MAC block is duplicated for each output from the kernel function unit to achieve full throughput.
Data Flow Control Generator
The Data Flow Control Generator shown in Fig. 11 contains instantiations of memory blocks and state machines that control the data flow for the N-body problem. There are several types of memory components in an FPGA including registers, FIFOs, RAMs, and ROMs. The specific components are chosen based on an analysis of access patterns and size of each input type. In general, constants are stored in registers, and input data are stored in one FIFO and two single-port RAMs. In addition, FIFOs are used to store results; the number of these FIFOs depends on the number of kernel function pipelines used.
There are three state machines responsible for controlling data flow to provide high-throughput performance. The first state machine handles storing of input data. The second state machine controls hardware execution and generates signals for reading data from input FIFO and RAM, and writing the data to the output FIFOs. The third state machine decides the access order of output FIFO to avoid collisions. The data-flow sequence is summarized below.
1. The target information for one tile is stored. 2. The source information is fed in the sequence specified by the tiling method. The hardware execution starts as soon as the first source tile is available. 3. Once the entire source information for the current tile is read and the hardware execution has started, the next target tile is fetched.
4.
Steps (2) and (3) are repeated until all target information is processed and the result is calculated. In this architecture, spatial and temporal parallelism are exploited for improving performance. Specifically, spatial parallelism is exploited by using multiple kernel function blocks and sharing the source information that is common to all the target tiles. Temporal parallelism is exploited by overlapping the kernel function evaluation and accumulation phases.
The automated HDL generation for this block uses HDL templates that contain various parameter and generate sentences supported by Verilog-2001 standard. The parameter value setting and instantiation mapping are automatically done by our tool from information generated by CPA and kernel function code generator.
Host-FPGA Interface Block
The Host-FPGA interface block is composed of PCI interface controller that generates the control signals for the PCI Express interface core 1 and input/output buffer, as shown in Fig. 12 . The PCI interface controller block supports DMA operations to maximize the communication speed using burst-mode transactions. Burst-mode transactions enable minimum latency communication, and therefore, help to achieve the theoretical maximum throughput of the PCI Express interface.
TANOR Software Modules
The key task of the software module is to generate the host computer program (in C++) that facilitates the data communication between the FPGA accelerator and the host computer. The structure of the software system is composed of four hierarchical stages: Algorithm Optimizer, Packetizer/Depacketizer, Direct Memory Access Controller, and Device Drive for PCI. In the Algorithm Optimizer stage, matrix tiling and data traversal is implemented in MATLAB. The Packetizer generates the packets that combine input data and control commands necessary to utilize the hardware system. The Direct Memory Access (DMA) controller is the primary method of sending/ receiving data from/to the FPGA accelerator. Since the initiation of a burst-mode transaction from the host is not supported by the operating system, the transfer information is written to the FPGA accelerator. The accelerator then initiates send/receive operations utilizing burst-mode transactions. Jungo WinDriver is used as a device driver to initialize and set up PCI device registers.
The simplified procedure to operate the FPGA accelerator from the host computer is shown in Fig. 13 . The communication of data to and from the FPGA board is synchronized using interrupts. The input data are structured by the host interface by inserting commands that aid the hardware in identifying the tile boundaries. The structured input sequence is then written into an input buffer, which is, subsequently, read by the PCI Interface in bursts. The Host-FPGA Interface block generates an interrupt after processing a target tile, and the partial results are written back to the output buffer. The results are, in turn, processed and reordered by the host to fit the output format.
CASE STUDY
System Configuration and Experimental Setup
We first describe the experimental setup. The synthesis environment consists of Synplify Pro 8.6.2 and Xilinx ISE 8.2i.
The target platform is the Xilinx Virtex2Pro-100 device. More specifically, we have used DN6000k10PCIe-4 [41] logic emulation system as the target hardware platform. It consists of six Xilinx Virtex2Pro-100 devices out of which we have used two. The PCI Express block and data interface FIFO block are configured to provide PCI Express 4 Lane DMA mode of operation on one of the FPGAs, and the other FPGA is used to implement the blocks that are specific to the numerical algorithm. The test bed system is shown in Fig. 14 .
The user provides a MATLAB description of the kernel function, data sets, optimization objectives, and constraints through a GUI interface on the host computer as shown in Fig. 2 . TANOR takes about 3 seconds to translate the MATLAB specifications to HDL code on a Pentium-4 2.4 GHz processor. The hardware synthesis part takes time proportional to the size of the design that needs to be mapped. For instance, the synthesis time ranged from 1 to 7 hours for the following experiments. In the end, the result window shows area, timing, and accuracy.
In the rest of the section, we demonstrate the capability of our automated tool for three different kernel functions: the Gaussian kernel, the Gravitational kernel, and a force calculation kernel applied in molecular dynamics.
Interaction Kernel Function Examples
The Gaussian Kernel
The Gaussian kernel is frequently used in image processing and reconstruction [42] . A 2D image is used for the test data set. The Gaussian kernel is represented by
where x i ; x j are spatial positions of the ith and jth pixels, respectively. A truncated Taylor series expansion is used for hardware implementation of the function evaluation. Specifically, the Gaussian kernel is approximated by Taylor polynomial of degree 2.
The Gravitational Kernel
The Gravitational kernel is used in astrophysical N-body simulations [18] . In every time step of the simulation, the gravitational force, its time derivative, and the gravitational field potential exerted on a target set of particles, T, due to mass at a source set of particles, S, is computed. There are seven kernel functions for calculating the magnitude of these parameters at every target particle. The input test set consists of the source and target particle locations in a 3D space R
Here, t i and s j denote the spatial position of the target and source particles in T and S, respectively, kt À sk denotes the euclidean distance between t and s; mðs j Þ is the mass of the particle at location s j , and c is a constant. Equations (4) and (6) compute the acceleration and potential of a target particle t i , and (5) computes the time derivative of the particle acceleration.
Force Calculation in Molecular Dynamics
Force calculation in molecular dynamics simulation is computationally very intensive and is implemented as follows [30] :
wherer ij is the distance vector between atoms i and j in 3D space, and r ij is the magnitude ofr ij . A and B are constants and q i and q j are the charges on atoms i and j, respectively.
Evaluation Metrics
The following metrics have been used to evaluate the different configurations in our experiments:
. Area is reported by the number of occupied slices, and the maximum number of pipelines (or parallel versions) that can fit onto a single FPGA. . Power consumption is measured using XPower [43] for a 125 MHz clock. . Performance is obtained through actual time measurements on our target platform for a problem size of 5,000 particles averaged over 20 different executions. . Accuracy is computed in MATLAB. It is represented by the euclidean distance of the deviation vector between the FP (or FX) implementations and its double-precision counterpart, divided by the euclidean distance of the double-precision counterpart ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
6.4 Trade-Off Analysis Table 6 shows the performance of the three kernels with respect to power, performance, latency, and accuracy. We chose two different configurations of fixed-point and floating-point implementations to demonstrate the effect of precision on the performance metrics. The notation FX-[x y] stands for fixed-point implementation with x and y representing the maximum bit width and maximum fractional bit width, respectively. The notation FL-e m, stands for floating-point implementation with e and m representing the exponent and mantissa bit widths, respectively.
We first compare the kernels with respect to the number of operations, defined as the number of primitive operations used to implement each kernel function, and the number of pipelines per FPGA, defined as the maximum number of data-path pipelines that can be included in one FPGA. The gravitational kernel has the highest number of operations per data-path pipeline since there are seven kernel functions that have to be implemented. Consequently, it has the smallest number of pipelines per FPGA. In the Gaussian kernel, FL-e8m23 has the largest resource requirement due to its large precision, and hence, the smallest number of pipelines. In contrast, FX-[64 15] has the smallest resource requirement and supports the largest number of pipelines. Note that the number of pipelines is proportional to the performance, so the configuration with the highest number of pipelines also shows the best performance. Thus, configuration FX-[64 15] for the Gaussian kernel supports maximum number of pipelines and has the highest performance. It also has the lowest accuracy. In general, there is an inverse relationship between accuracy and performance. TANOR is flexible enough to use FL configurations for high accuracy and FX for high performance.
The latency refers to the number of clock cycles required to compute each kernel. The performance metric is latency insensitive because the implementation uses a fully pipelined architecture for the highest performance. The power numbers correspond to maximal utilization of resources in the FPGA. Since maximum possible number of pipelines are mapped to each FPGA, there is very little difference in the power numbers between different configurations.
Comparison with GRAPE-6
Next, we compare the hardware generated automatically by TANOR with the GRAPE-6 architecture that was custom designed for numerical simulation of gravitational interactions. The results are shown in Table 7 . We see that the hardware generated by TANOR achieves a performance comparable to the GRAPE-6 chip in terms of FLOPS and power consumption. A direct comparison is difficult because of the differences in underlying technology (0:25 versus 0:13 m) and implementation style (ASIC versus FPGA) of the two systems. However, these results serve as an indicator of the quality of the output generated by TANOR.
Algorithm-Architecture Codesign
In this section, we illustrate the use of TANOR for joint exploration of algorithmic and architectural options. The results presented in the next two sections are from actual synthesis, P&R steps.
Example 1
The area, power, and accuracy trade-offs for different configurations for the Gravitational kernel are shown in Fig. 15 . The results are for a single pipelined structure. We see that the accuracy, area, and power consumed increase with increase in the number of mantissa bits. We also show the effect on accuracy for two types of data traversal schemes, namely, PT and GT, in Fig. 15c . The use of GT with mantissa bit width of 16 can achieve accuracy comparable to that of PT with mantissa bit width of 20. Note that this decrease in the bit width translates into a 20 percent reduction in utilized area on the FPGA, and a 19 percent reduction in power consumption, as shown in Figs. 15a and 15b. 
Example 2
Next, we show how TANOR facilitates design space exploration with the computation of the Bessel function J 0 ðxÞ using LUT-based interpolation schemes. The inputs to J 0 ðxÞ are the distances between particles in the target set T and the source set S. Each set contains 5,000 particles randomly generated in the range ð0; 100Þ. The accuracy requirement is of the order of 10 À5 . Table 8 shows the number of additions and multiplications used for address calculation and interpolation. It also lists the memory size, which is expressed in terms of number of table entries to store the precomputed derivatives used in Taylor series expansion. Each individual configuration is labeled PT n (for plain tiling) and GT n (for geometric tiling), where n stands for degree of the Taylor polynomial.
The simplest way to satisfy the accuracy requirement is to apply a direct lookup scheme, PT 0 , which uses a large lookup table because of the high sampling density. Such a table cannot fit into a single XC2VP100 device and so we make use of PT 1 which uses a first degree Taylor polynomial to significantly reduce the table size, at the cost of additional multipliers and adders. PT 2 and PT 8 show that the table size can be further reduced if we continue to increase the degree of the Taylor polynomial.
Next, assume that the target architecture imposes the constraints of 3 KB for table size and 1,000 for number of occupied slices. From Table 8 , we see that PT 1 achieves the accuracy and slice constraints but violates the memory constraint. The Taylor polynomial degree is increased and the memory constraint is satisfied for degree eight corresponding to PT 8 . However, this requires a large number of slices that far exceeds the area constraint. Since it is impossible to find a balance between area, accuracy, and memory only by changing the degree of the Taylor polynomial in conventional PT schemes, a GT-based scheme GT 2 is applied. It uses a lower degree of the Taylor polynomial (degree two in this example), and thus, requires fewer slices and multipliers, fewer number of lookup table entries, and much smaller memory.
Model Validation
The results presented in Sections 6.6.1 and 6.6.2 were generated from actual synthesis, P&R steps. For larger designs, this may easily take hours. In comparison, the area and power estimates using the models described in Section 4.2 take a few seconds or less. In this section, we compare the estimation results with those obtained by actual synthesis followed by P&R for a representative set of examples. Table 9 shows the results for Gaussian kernel (Gauss) and Bessel function (J 0 ðxÞ ) [38] . Both functions are implemented using LUT-based interpolation schemes in fixed point; the notation ðn; X; Y Þ means n is the degree of Taylor polynomial, X is the maximum number of total bits, and Y is the maximum number of fraction bits. For the eight configurations, the average error is 6.84 percent for the number of slices and 5.90 percent for the total power. The results of block RAM and 18 Â 18-bit Multipliers are not shown because there is no mismatch between the estimated and synthesized results. The accuracy of our model is quite high even for large designs. For instance, for 8-point FFT implemented in floating-point format, the average error is 1.86 percent for the number of slices and 3.49 percent for the total power [38] .
CONCLUSION
We have presented TANOR, a framework for automatic generation of efficient system architectures for accelerating numerical computations on a reconfigurable platform. TANOR generates the desired hardware modules and a software data communication interface, starting from a high-level MATLAB description. It incorporates a highlevel synthesis flow and supports custom fixed-point and floating-point configurations. It is also capable of supporting many transcendental functions used in scientific simulation and signal and image processing through LUT generation. Finally, TANOR enables the joint exploration of algorithmic and architectural options.
We have demonstrated TANOR's capabilities with three different N-body applications though TANOR can be applied to a wide spectrum of DSP applications. TANORgenerated accelerators are shown to be competitive with other existing custom-designed hardware accelerators such as GRAPE-6. Further, by coexploration of variable precision architectures and data traversal schemes, TANOR automation flow is able to achieve a 20 percent reduction in resource utilization, and a 19 percent reduction in power consumption while maintaining comparable accuracy. Xiaobai Sun is a professor of computer science at Duke University. Her research interests and efforts focus on numerical algorithm design and analysis, especially, in bridging and blending mathematical models and computer architectures for scientific simulation and signal processing.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
