Abstract-Because of the increasing need to develop efficient high-speed computational kernels, researchers have been looking at various acceleration technologies. One approach is to use field programmable gate arrays (FPGAs) in conjunction with general purpose processors to form what are known as high performance reconfigurable computers (HPRCs). HPRCs have already been shown to work well for both fixed-point and integer calculations. Floating-point calculations are a different matter; obtaining speedups has been somewhat elusive. This article, after introducing the three primary HPRC development flows, takes a detailed look at "the three p's," which addresses the crucial relationship among performance, pipelining, and parallelism. It also examines "the FPGA design boundary," which addresses some of the heuristics that allow developers to determine which application modules can be mapped onto the FPGAs. These ideas are illustrated by way of a simple floating-point application that is mapped onto a contemporary HPRC. This article expands upon earlier work by including details on how to map customized intellectual property cores into an HPRC environment via a hybrid development flow.
I. INTRODUCTION
The idea of a "fixed plus variable structure" reconfigurable computer (RC) has been around since 1960 and is attributed to Estrin [10] . However, the technological limitations of that era, such as bulky motherboards, manual wiring harness, etc., thwarted the development of the RC. In addition, most of the research during the following decades was focused on the general purpose processor (GPP). The revival of the RC was precipitated by Freeman's invention of the field programmable gate array (FPGA) in 1984 [36] . Less than a decade later, in 1991, Algotronix introduced the CHS2x4, which is considered to be the first commercially available RC [18] . In 1996, Seymour R. Cray's start-up company, SRC Computers, Inc., developed the SRC-6, which is arguably the first commercially successful high performance reconfigurable computer (HPRC) [29] . Commercial HPRCs from vendors such as SRC Computers, SGI, and Cray have ushered in a new era in the field of HPRC research.
This article, which is an expanded version of [28] , describes "the three p's," which addresses the crucial relationship among performance, pipelining, and parallelism, and it examines "the FPGA design boundary," which considers heuristics that allow developers to determine which application modules can be mapped onto the FPGAs. This expanded version also includes details on how to map intellectual property (IP) cores into an HPRC development environment via the hybrid design flow.
The rest of this article is organized as follows: Section II is a brief look at related research. Section III describes the principal HPRC application development approaches. Section IV gives a simple example that illustrates why mapping floating-point applications onto HPRCs can be challenging. Section V, which is really the focus of the article, takes a look at various HPRC application design considerations. Section VI expands upon the earlier work by mapping IP cores into the HPRC development environment via the hybrid design approach. Section VII identifies potential future work and provides a conclusion.
II. RELATED RESEARCH
Researchers have had success in mapping some kernels to FPGAs, especially in the area of fixed-point and integer calculations. Taher et al. describe an implementation of a generic wavelet filter on the SRC-6 HPRC [32] . This filter facilitates the implementation of a wide range of discrete wavelet transform filters and outperforms conventional software by a factor of at least 10.5. El-Ghazawi et al. implement an HPRC version of an automatic waveletbased dimension reduction algorithm for preprocessing of hyperspectral imagery, which garnered an order of magnitude speedup over software [9] .
Buell et al. successfully port the Defense Advanced Research Project Agency Benchmark 5, which matches short bit strings against very long bitstreams, onto a contemporary HPRC. This was significant in that the benchmark itself measures the performance of high productivity computing systems, and the HPRC used in that research achieved maximum performance [4] . Using a simple application that computes the ratio of two fifthdegree polynomials, Kindratenko et al. provide an excellent tutorial on the use of the SRC-6 HPRC [20] .
Catanzaro and Nelson show that using higher radices for floating-point representations can result in up to a 30 percent smaller area-time product while "delivering equal worst-case and better average-case numerical accuracy" [5] . Wang et al. develop a library of variableprecision floating-point cores that allow a developer to use something other than the 32-bit or 64-bit precision afforded by GPPs [34] . Van Court and Herbordt recommend considering appropriate arithmetic precision, latency hiding, and using all of the FPGA chip resources (it does not make sense to throw away available computational capacity) [33] .
For applications that require rigorous adherence to IEEE 754 floating-point arithmetic, Govindu et al. present a library of deeply pipelined, parameterizable floatingpoint cores that are able to achieve a frequency of up to 170 MHz on a Xilinx Virtex-II Pro [11] . During a panel session at Supercomputing 2006, El-Ghazawi et al. raise the question "Is high-performance reconfigurable computing the next supercomputing paradigm?" [8] . HPRCs have even entered into the embedded systems world. In April 2007, Lockheed Martin chose SRC Computers, Inc. to provide embedded HPRC systems for the U.S. Army's TRACER program [17] .
Despite these advances, application development for HPRCs still poses a number of challenges, especially for floating-point scientific applications, where obtaining a speedup can sometimes be elusive. The application described in this article is one example, and there are others. In their research on HLL-HDL transformation, Böhm and Hammes did not achieve a speedup for a Gauss Seidel iterative solver because of the loop-carried dependence associated with floating-point cores [2] . They state that "this and other floating-point macros have not been integrated in the MAP compiler yet." Akella et al. did not speed up their floating-point sparse-matrix vector multiply (SMVM) kernel [1] because the floating-point accumulators in the Carte 2.1 compiler do not allow for fully pipelined inner-loop accumulation. 1 They conclude that "performance is still about 2-2.55 times slower than software."
Despite the complexity, some researchers have mapped floating-point algorithms onto HPRCs and shown promising results. DeLorimier et al. describe a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput and near-peak floating-point performance [7] . Morris and Prasanna achieve better than a twofold speedup for an IEEE 754 double-precision floating-point sparse matrix iterative solver and estimate that the same design on a next-generation HPRC could achieve a six-fold speedup [24] . Kindratenko et al. obtain a tenfold speedup for a double-precision floating-point implementation of a two-point correlation function [19] . Clearly, there are circumstances under which speedups can be obtained, even for floating-point applications.
III. APPLICATION DEVELOPMENT FLOWS
An application targeted for an HPRC can be divided into two sets of modules: software (SW) modules and FPGA modules. SW modules, which are written in a traditional high-level language (HLL), are used to produce binary executables targeted for GPPs. Developers employ standard software development tools such as editors, compilers, debuggers, linkers, etc., to design, debug, and produce the executable. FPGA modules are used to create the configuration bitstreams that are loaded onto the FPGAs to produce the reconfigurable applicationspecific processors. These modules can be written using a standard hardware description language (HDL) such as Verilog [16] or VHDL [15] , an enhanced HLL such as Carte C [29] or JHDL [3], or a combination of HDL and HLL. Developers employ FPGA development tools and, in the enhanced HLL-based cases, specialized HLL-HDL compilers to produce the FPGA configuration bitstream. Each design flow introduces certain challenges, which are described below. Development of software modules is not the focus of this article, so only the development of FPGA modules is covered in the sections that follow.
A. HDL Development Flow
The HDL-based flow is illustrated in Fig. 1 . The FPGA module design entry is accomplished using a standard HDL. Vendor-specific IP cores are used to implement fea- tures such as GPP and memory interfaces. The design is submitted to the synthesis tool (SYNTH), which produces a netlist. The IP core and FPGA module netlists are submitted to the place and route (PAR) tool, which produces a target-specific circuit description. Finally, a bitstream is produced by the bit generation tool (BITGEN) and loaded onto the FPGA. The resulting application is executed in a cooperative manner by the GPP and FPGA. There has been little success using the HDL-based design flow to accelerate scientific applications. There is significant coupling between the software and FPGA modules, i.e., the software module must make application programmer interface (API) calls to access and control the FPGA module. There is also the complexity of hardware design and its impact on developer productivity. The net result is that this primitive development flow is not suitable for mainstream HPRC application development.
B. HLL Development Flow
Estrin acknowledges the need for "higher level languages for man-machine communication" in his seminal work on RCs [10] . Both private industry and the research community have responded to this need by developing enhanced HLLs to program HPRCs. In addition to Carte- C and JHDL, which were mentioned above, other popular HLL-based FPGA module development languages include Handel-C [6] and Mitrion-C [22] . These enhanced HLLs and the associated HLL-HDL compilers typically support pipelined loops, parallel code sections, synchronization primitives, communication channels (or streams), and other features that allow development of high performance FPGA kernels. As Fig. 2 indicates, the HLL-based flow starts with a design entry that is coded in an enhanced HLL. The intellectual property interface (IPI) allows HLL access to the vendor-supplied IP. The HLL is fed into an HLL-HDL compiler, which emits an HDL that is processed by the standard FPGA tool chain to produce the configuration bitstream. The HLL-based approach is certainly better than the HDL-based flow because there is minimal coupling between the software and FPGA modules, and the coding in an HLL allows for increased developer productivity. However, there are still challenges associated with this flow. The vendor IPI limits flexibility by restricting the developer to vendor-specific IP cores. Furthermore, this development flow can be somewhat misleading; even though it uses an HLL and looks like software, it is not software and should not be approached as such. Lastly, because of the deep pipelining needed to achieve high performance, it is often difficult to speed up floating-point computations.
C. Hybrid Development Flow
The hybrid-based development flow, shown in Fig. 3 , is essentially the same as the HLL-based flow except it allows for an extensible IPI that accommodates vendor IP cores, user-defined HDL designs, and even third-party IP cores. These features make this approach more flexible than the HLL-based flow and will help to bridge the gap between hardware developers and scientists. The hardware developers would be responsible for the development of a library of customized IP cores and their associated HLL interface; scientists would reference the cores via a simple software function call. The specifics of the hybrid approach vary depending upon the target HPRC and compiler; for example, the Carte compiler refers to this as "integrating user macros" [30] , but the general concepts are the same. The challenges associated with this approach include all those associated with the HLL-based flow coupled with the need for both hardware design and integration of the HDL and IP cores. Section VI provides more details on this integration process for a modern HPRC development environment.
IV. MAPPING PROBLEM
Dividing work between the GPP and the FPGA is not a trivial task. Despite the advances in HLL-HDL compilers, the technology still remains far from a recompile and go approach. Taking existing software codes and compiling them on an HPRC via an HLL-based flow will not generally yield a speedup and may even result in a slowdown. For example, Park's attempt to accelerate the Blowfish algorithm resulted in a fortyfold slowdown when implemented using DIME-C [26] . Developing or porting applications to HPRCs is still an art form, which relies heavily on the skill and experience of the developer. To illustrate the problem, the runtime of a software-only version of a simple quadratic equation solver is compared with the runtime of a naive HPRC implementation of the same algorithm.
A. Software-Only Version
The pseudo code for the software-only main routine is shown in Fig. 4 , and the software-only version of the quadratic equation solver is idealized in allocate(a n , b n , c n , x 2n ) 4: load (n, a, b, c)
elapsed ← now − start 7: outputResults(elapsed, n, x) 8: end procedure routines were coded in C in a straightforward manner and, to obtain maximum performance, compiled at -O3 using the Intel 8.1 compiler on an SRC-6 HPRC. Note, that the software version only used the Xeon GPP on the SRC-6, not the FPGAs. In this pseudo code, n is the number of equations to be solved, vectors a n , b n , and c n are the coefficients for each equation, and vector x 2n , which is twice as long as the other vectors, holds the solution pair for each equation.
1: procedure SWQE(n, a n , b n , c n , x 2n ) 2:
4:
end for 6: end procedure cache aligned allocate(a n , b n , c n , x 2n ) 4: load (n, a, b, c) 5: allocate FPGA(m) 6: NaiveQE(n, a, b, c, x, m) 7: elapsed ← now − start 8: outputResults(elapsed, n, x) 9: end procedure FPGA module (MAP function in SRC parlance) was developed from the software code by adhering to the minimal requirements of the Carte compiler. The MAP number parameter, m, was added to the call interface. As suggested by lines 3-5, direct memory access (DMA) was used to load the coefficient vectors (a, b, c) into on-board memory (OBM) arrays (A, B, C). The loop was modified to read the coefficients from OBM arrays and to put the 1: procedure NAIVEQE(n, a n , b n , c n ,
dma(A, a, n) // DMA values into OBM banks 4: dma(B, b, n) 5: dma(C, c, n) 6 :
8:
end for 10: dma(x, DE, 2n) // DMA results back to GPP 11: end procedure 
C. Results
The main routines were instrumented with a µs-resolution timer to capture the wall clock runtime, and both versions were used to solve multiple sets of quadratic equations. A shell script executed each version 100 times for each of the test sizes in order to obtain average runtimes. Table I shows the results of this simple exper- 
V. DESIGN CONSIDERATIONS
This section takes a detailed look at "the three p's," which highlights the crucial relationship among performance, pipelining, and parallelism. It also examines "the FPGA design boundary," which addresses some of the heuristics that allow developers to determine which application modules can be mapped onto the FPGAs.
A. The Three P's
FPGA clock rates are in the 100s of MHz range, whereas GPP clock rates are on a GHz scale. Given this order-of-magnitude advantage on the part of GPPs, it is clear something must be done at the design level to compensate, i.e., that certain guidelines must be followed in order for an FPGA to compete with a GPP. As suggested by Fig. 8 , the performance of an algorithm on an FPGA is proportional to the extent to which it is pipelined and parallelized. This multiplicative effect, which is known as the three p's, expresses the important relationship among performance, pipelining, and parallelism [23] . Specifically, both operations and datapaths must be pipelined and parallelized. In mapping an algorithm onto an HPRC, failure to either pipeline or parallelize the kernel generally results in poor performance. Section IV included a naive HPRC implementation of the quadratic formula. That experiment showed that a naive HLL-based flow approach might not provide a speedup (in that particular case, there is actually a significant slowdown). The same algorithm will now be considered in light of the three p's in order to pipeline the implementation and extract available parallelism. Even though an HLL-based flow will eventually be used, an HDL-based flow will initially be assumed. Three steps are involved: (a) start with the algorithm and make appropriate component substitutions, (b) create a data flow graph (DFG) to uncover operation-level parallelism (OLP), and (c) equalize DFG path lengths to create a fully pipelined design. Some of these capabilities have been incorporated into HLL-HDL compilers, but the thought process for an HDL-based flow is still useful in an HLLbased environment.
First, it is imperative to start with the algorithm, not the code, as shown in Fig. 9(a) . Component substitutions based on size, latency, clock rate, etc., are made. In the example, two dividers and one multiplier are replaced with one divider and two multipliers, which completely hides the long latency of the output dividers. To avoid the use of a power function macro, b 2 is calculated as b × b. Second, as shown in Fig. 9(b) , a DFG is created to account for true data dependencies that may occur and to uncover available parallelism. There is a four-way OLP at the input and a two-way OLP at the output of the quadratic equation DFG.
Third, in the pipelining portion of this approach shown in Fig. 9(c) , each path length must be equalized in order to synchronize the data. Equalizing the path lengths begins by annotating each of the components with its latency, e.g., the divider has a latency of α d . Then, the longest subpaths (critical paths) are identified. Finally, delay registers are added. The result is a fully pipelined and parallelized quadratic equation datapath, which has a latency of α q = 3α m + 2α a + α s .
declare stream(S1, S2) 4: dma(A, a, n) // DMA values into OBM 5: dma(B, b, n) 6: dma(C, c, n) for i in [0, n) do // parallel w DMA block 9: a2 ← 0.5/A i // break into atomic ops 10: bb ← B i · B i // to be parallelized 11: ac ← A i · C i // and pipelined 12: mb ← −B i // by HLL-HDL compiler 13: ac4 ← 4 · ac 14: D ← bb − ac4 15: sqr ← √ D
16:
bP sqr ← mb + sqr 17: bM sqr ← mb − sqr 18: put stream(S1,bP sqr · a2) // 2 streams to 19: put stream(S2,bM sqr · a2) // dma block 20: end for 21:
stream dma(x,S1,S2,2n) // DMA to x 23:
parend 25: end procedure Ideally, several of these fully pipelined datapaths should be used in parallel to implement the FPGA module. In practice, the limited number of concurrent OBM bank accesses in the target HPRC precluded implementing more than one datapath. This violates the three p's, so (not too surprisingly) an actual speedup was not obtained. Nonetheless, to illustrate the concepts, these ideas will still be used to implement an improved solver. When using an HLL-based approach, one should break up monolithic algebraic expressions into the discrete assignments shown in the DFG, e.g. Fig. 9(b) . The subsequent parallelization and pipelining can often be handled by the HLL-HDL compiler. In the following examples, based on the quadratic equation solver presented earlier, two different input/output (I/O) approaches are used: 1) streaming DMA for output, and 2) streaming DMA for input. The idea is to overlap communication with computation. Note that the target HPRC does not support simultaneous streaming DMA to and from the GPP.
Stream Out HPRC Version: In lines 4-6 of the pseudo code in Fig. 10 , the FPGA module employs the same DMA input approach used in the naive implementation shown in Fig. 7 . However, the output uses a dualstreaming DMA approach. The loop at line 8 and the output block at line 21 operate in parallel, as suggested by the parbegin construct at line 7. Additionally, loop computations are broken up into atomic operations to allow the HLL-HDL compiler to pipeline and parallelize the loop body. The bottom of the loop produces a pair of streams, S1 and S2, which are consumed by the output DMA on line 22. In the naive approach, the loop latency of the actual code was 169 clock cycles. In this approach, which exhibits a 4 × 2 OLP, the loop latency has dropped to 120 clock cycles because the HLL-HDL compiler finds more parallelism.
1: procedure STREAMINQE(n, a n , b n , c n , x 2n , m) 2: declare obm(DE n ) // only need result array for i in [0, n) do // parallel w DMA block 11: get stream(A, SA) // each loop has 12: get stream(B, SB) // new A, B, C 13: get stream(C, SC) 14: a2 ← 0.5/A // break into atomic ops 15: bb ← B · B // as before 16: ac ← A · C // to assist 17: mb ← −B // HLL-HDL compiler 18: ac4 ← 4 · ac 19 :
bP sqr ← mb + sqr 22: bM sqr ← mb − sqr 23 :
E i ← bM sqr · a2 // in OBM banks 25: end for
26:
parend 27: dma(x, DE, 2n) // DMA results back to GPP 28: end procedure Stream In HPRC Version: In line 27 of the pseudo code shown in Fig. 11 , the FPGA module uses the same DMA output approach used in the naive implementation. However, the input uses three streaming DMAs, which are done in parallel, as suggested by lines 6-8. The loop computations are done in parallel with the streaming input, so the loop appears to be operating with scalar values of A, B, and C during each iteration. In this approach, which also exhibits a 4×2 OLP, the loop latency of the actual code has dropped to 113 clock cycles.
Results: As with the naive case presented earlier, these two implementations were run multiple times on multiple data sets. Table II shows the results of the experiments. As expected, the single datapath violated the three p's, so there was not an overall speedup. Nonetheless, these experiments are useful in that they illustrate a basic approach for designing a pipelined, parallelized datapath. In addition, they show that both datapath parallelism and OLP are needed. Of particular note is that the pipeline length dropped from 169 cycles in the naive case to 120 cycles and 113 cycles for the Stream Out and Stream In cases, respectively. The performance issue will be taken care of via the next generation of HPRCs, which have (among other improvements) a larger number of OBM accesses per clock cycle, i.e., the ability to have multiple datapaths executing in parallel.
B. FPGA Design Boundary
Determining the "FPGA design boundary," i.e., determining which application modules should be mapped onto FPGAs, is not straightforward [23] . As with many engineering disciplines, one must also rely on heuristics derived from empirical observation. Some areas to be considered when determining the FPGA design boundary are itemized below.
• The three p's 
The three p's:
Perhaps the most important heuristic is the three p's previously described. As shown in the experiments, if a module cannot be pipelined and parallelized, then it is unlikely to achieve high performance when mapped onto an FPGA. Even if a module is three p's compliant, it still needs to have enough data to keep the pipelines filled, i.e., to amortize pipeline latency across multiple problems. Thus, a corollary FPGA design consideration is to ensure the length of the data stream is sufficiently large.
Overall speedup (Amdahl's Law): The objective of mapping algorithms to HPRCs is to obtain a speedup relative to the performance of a GPP. Overall speedup can be quantified via Amdahl's Law [13] 
where s o is the overall speedup, f e is the fraction of the system to be enhanced, and s e is the speedup of the portion to be enhanced. This particular law serves as a fundamental basis for design decisions. For example, suppose the estimated speedup for the edge detection portion of a target tracking system was a thousandfold, i.e., an FPGA implementation of the edge detection kernel was estimated to run an incredible 1000 times faster than an equivalent software module. Further suppose edge detection accounted for five percent of the runtime. With a thousandfold speedup, intuition says the FPGA-based edge detection kernel would yield a significant overall speedup. However, after applying Amdahl's Law,
it is seen that the overall speedup associated with the FPGA-based kernel is hardly worth the effort, and that in this case, fast is not always fast. This relationship between overall speedup and the fraction of a system that can take 10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75%  80%  85%  90%  95%   2  5  10  100  1000 actual speedup values advantage of the speedup is depicted in Fig. 12 . The x-axis represents the fraction of the system to be enhanced (f e ), and the y-axis is the overall speedup normalized to the speedup of the portion to be enhanced ( Notice that for f e = 50 percent, the overall speedup values are only 2.0 for s e = 100 and s e = 1000. The lowly s e = 2 does nearly as well at 1.3. If 75 percent of the system could take advantage of a thousandfold speedup, the overall speedup value will only be 4.0. Even if a whopping 95 percent of the system could use the thousandfold speedup, the overall speedup would only be 20. Thus, the overall speedup will be minimal, especially for large values of s e , unless the module to be placed on the FPGA consumes a significant part of the overall runtime. Clearly, the use of Amdahl's Law in determining the FPGA design boundary is important.
Expected resource utilization: Another important FPGA design boundary consideration is the expected FPGA resource utilization of the candidate module. Since floatingpoint IP cores can be quite large, the developer needs to determine if the candidate will even fit on the FPGA. The developer also needs to consider the needed local memory capacity, number of simultaneous memory accesses, anticipated clock rate in the light of complex routing, etc.
Control/memory vs. compute intensive: It is also imperative to consider whether an algorithm is control/memory intensive or compute intensive. The control aspect is similar to the branching problem in a GPP, and the memory aspect is similar to a GPP where accessing memory data takes a considerable amount of time compared with arithmetic operations. Harkins et al. illustrate the importance of this concept when they show that sorting algorithms do not perform very well on an HPRC [12] .
Monolithic module: Another design consideration is that hardware cannot call hardware. If the candidate FPGA module contains procedure calls, they have to be inlined or the module cannot be considered as a viable candidate. Obviously, this will be impacted by the available FPGA resources.
Available bandwidth: The GPP to FPGA bandwidth also deserves attention. Obviously, the FPGA memory access and processing time should be less than the GPP memory access and processing time. According to Herbordt et al., when they discuss latency hiding, a design should try to overlap computation with communication [14] . This might minimize the effects of bandwidth limitations. A closely related issue is data reuse, to be discussed next.
Data reuse: Algorithms that have a significant potential for data reuse may be suitable FPGA module candidates. Morris and Prasanna use this principle to speed up two well-known iterative solvers [25] . This is similar to methods used by the GPP where frequently used data are stored in nearby memory such as general-purpose registers or cache.
Algorithm design stability: Since mapping an algorithm to an FPGA is not the easiest of tasks, it is imperative to make sure that the algorithm is as stable as possible. If the algorithm is altered while in the midst of a hardware implementation process, one could easily discover that the new algorithm no longer fits onto the FPGA, or that it can no longer deliver on the promised speedup.
Algorithm efficiency: Another application design consideration is to make sure an efficient algorithm is employed. For example, Cramer's rule, which has exponential complexity, O(e · (n + 1)!), might run faster if implemented on an FPGA. However, Gaussian elimination, with complexity, O(n 3 ), is a much more efficient algorithm. In this case, one would use a software solution rather than map the inefficient algorithm onto an FPGA.
VI. HYBRID-BASED IMPLEMENTATION
The strict HLL-based approach depicted in Fig. 2 can not be used in all circumstances. Perhaps the developer must use a proprietary IP core or maybe the HLL-HDL compiler cannot generate HDL that meets timing or resource constraints, etc. In such cases, the hybrid approach depicted in Fig. 3 might offer a suitable solution. This section describes how to map IP cores into an HPRC development environment.
A. Conceptual Overview
In the simplest case, as depicted in Fig. 13 , the entire kernel has been implemented as an IP core, and the FPGA stream dma(SA, a, n) // stream DMA 8: stream dma(SB, b, n) // coefficients 9: stream dma(SC, c, n) // via SA, SB, SC 10:
for i in [0, n) do // parallel w DMA block 12: get stream(A, SA) // each loop has dma(x, DE, 2n) // DMA results back to GPP 19: end procedure key points are that all the computation is done in the IP core, and that access to the IP core looks like a software call because of the interface mechanism.
In the general case, as depicted by Fig. 15 , only a part of the FPGA module is implemented via IP cores. The rest of the module is implemented using the enhanced HLL. In this latter example, the developer uses a divider IP core but implements the rest of the FPGA module in HLL. If one were to incorporate a divider IP core into the "Stream In" code described earlier, then the FPGA for i in [0, n) do // parallel w DMA block 12: get stream(A, SA) // each loop has 13: get stream(B, SB) // new A, B, C 14: get stream(C, SC) 15: fpdiv(0.5, A, a2) // call fpdiv 16: bb ← B · B // everything 17: ac ← A · C // else same as 18: mb ← −B // before to 19: ac4 ← 4 · ac // allow 20:
sqr ← √ D // to parallelize 22: bP sqr ← mb + sqr // and pipeline 23: bM sqr ← mb − sqr // the module 24: D i ← bP sqr · a2 // store result 25: E i ← bM sqr · a2 // in OBM banks 26: end for
27:
parend 28: dma(x, DE, 2n) // DMA results back to GPP 29: end procedure of the IP functionality is irrelevant. One must still create an interface to the IP core and then call it from within the HLL-based FPGA module.
B. IP Core Development
For this research effort, the authors created the IP core from VHDL source code. In the general case, this may not be true, i.e., a properly tested and documented off-theshelf core could be used. The derivation of the divider IP is briefly summarized in this section: Using an offline engineering workstation (not the SRC-6 HPRC), a Xilinx ISE project [35] was built based upon the Rice-Govindu floating-point divider VHDL source code [27] , [11] . The floating-point divider top-level interface, which is called an entity in VHDL, is idealized in Fig. 17 description of the entity is not needed; one must simply note that the two inputs, x and y, are 64-bit quantities, there is a clock, and there is a 64-bit output, z. The details of the implementation, which is called an architecture in VHDL, are also not needed. The important point is that the entity and its architecture are ultimately synthesized into a 58-clock cycle pipelined IEEE 64-bit floating-point divider IP core entitled divIP.edn. A VHDL test bench and set of input vectors were created so the floating-point divider code could be rigorously tested using the ModelSim VHDL simulation environment [21] . The divider VHDL code was subsequently synthesized using Synplify Pro [31] , and the resulting electronic design interchange format (EDIF) netlist file (IP core) was then uploaded to the SRC-6 for integration into the HPRC development environment.
C. IP Core Interface
The discussion in this section centers around the Carte compiler and SRC-6 HPRC that were used in this research. If one were to use another compiler or target HPRC, the details would be different, but the concepts would be the same. The basic mapping process is hierarchical, as suggested by Fig. 18 . It is, in some sense, a reverse engineering of the EDIF file back to HDL, and then to HLL. The starting point is the EDIF edn netlist file. While not quite accurate, one can think of the EDIF as being a nongraphical representation of the schematic diagram. A key point is that the hardware design encapsulated by the EDIF has a name and some set of input ports and output ports that 1) must be made visible to the FPGA HDL tool chain, and 2) must be mapped onto an HLL procedure.
The blackbox, as depicted in Fig. 19 , is a Verilog module that describes the input and output ports and serves as an HDL interface description. contains a "debug" implementation of the IP core. This is a software-only functional equivalent of the IP core. The Carte environment uses the debug code during early development to verify FPGA module functionality without having to go through the (usually time-consuming) FPGA tool chain, i.e., it executes a software-only functional equivalent of the FPGA module. A representation of the info file is shown in Fig. 20 . The MACRO line maps the IP core name "divIP" onto a software procedure name "fpdiv." The information about divIP being a 58-stage pipeline helps the HLL-HDL compiler optimize loops. The input and output ports are mapped onto 64-bit floating point data type parameters, and the clock input signal is hidden from the caller. Remember, the idea is to make the IP core hardware design look like a software call, so the hardware notions of ports, bits, and clocks are abstracted away. Finally, the info file includes an implementation of the debug functionality.
D. Comparison
For comparison purposes, two versions of the quadratic equation solver were implemented: the software-only version shown in Fig. 5 , and the hybrid version shown in Fig. 16 . The pseudo code for the respective main routines are shown in Fig. 4 and Fig. 6 . Note that the latter code calls DivQE rather than NaiveQE. As with the codes presented in Section V, these two implementations were run multiple times using multiple data sets. Table III shows the results of these experiments. As before, the single datapath violated the three p's, so there was not an overall speedup. However, as will be shown in the next section, this does not invalidate the research results.
E. Discussion of Performance on Next-Generation HPRC
The Kiviat diagram in Fig. 21 compares the capability of the SRC-6 HPRC used in this research with the nextgeneration SRC-7 HPRC. The footprint of the SRC-7 system swamps the footprint of the SRC-6. With nearly an order-of-magnitude increase in available user logic, threefold increase in OBM bandwidth, 50 percent increase in clock speed, two orders-of-magnitude increase in local memory, more than a threefold increase in peak 64-bit GFLOPS, etc., it is obvious that this next generation HPRC will facilitate significant improvements in floatingpoint application performance. In the research described in this article, for example, it would be possible to put at least two quadratic equation datapaths onto the FPGAs. If one factors in the 1.5 clock speed improvement, this translates into a threefold performance boost. Add to this the ability to store much larger data sets, and it is clear that the performance deficiencies of the HPRC described in this article are a temporary problem that will be overcome via the next-generation technology. research was to show how one would port a computational kernel into an HPRC environment. That objective has been met.
VII. FUTURE WORK AND CONCLUSION

A. Future Work
Floating-point computational kernels can sometimes be accelerated when mapped onto the FPGAs of modern HPRCs. Such mapping is currently a manual process that relies upon the skill and experience of the designer. This article summarizes some of the known design heuristics under one umbrella; as new rules of thumb are discovered, the authors plan to "add them to the list" as it were. The long-term goal is for tomorrow's compilers to automate these design heuristics and eliminate some of the trial-and-error associated with mapping scientific applications onto HPRCs and other heterogeneous architectures. Jackson State University has recently acquired a multinode SRC-7 cluster with an Infiniband interconnection network. The vendor has completed the initial installation of this state-of-the-art HPRC cluster and is nearly finished with the remaining field-level upgrades. As previously noted, the increased number of memory banks, clock rates, and the FPGA area of these new machines will allow for significant design improvements and a corresponding improvement in speedup. The authors plan to revisit some of their earlier research efforts and map computational kernels such as conjugate gradient, etc., onto these next-generation HPRCs. The authors also plan to look at a number of real-world parallel scientific codes and attempt to speed them up by mapping them onto the SRC-7 cluster. The goal, aside from speeding up the applications, is to further refine the design heuristics (what works and what does not work) relative to mapping floating-point applications onto HPRCs.
B. Conclusion
The overall goal of using an HPRC is to obtain better performance than can be achieved via traditional software.
At the current time, obtaining this performance edge is somewhat elusive for floating-point applications. This article summarizes many of the heuristics that the authors and other researchers have uncovered. In particular, the article describes the three p's, the FPGA design boundary, and other factors developers must consider when mapping an application onto an HPRC system. A simple example was used to illustrate the concepts. The article included a section dealing with the mapping of customized IP cores into an HPRC development environment. The jury is still out on whether HPRCs will become part of mainstream high performance computing. However, given the positive results from various research efforts, the expected advancements in FPGA technology, the likelihood of additional compiler optimizations, and the documented improvements in next-generation HPRC performance, one can anticipate a larger role for HPRCs.
