In aerospace industry, computational fluid dynamics (CFD) is used as a common design tool. Fast Aerodynamics Routines (FaSTAR) is one of the most recent CFD software package, convenient for users with various solvers and automatic generation of grid data. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions and trends. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using a large number of chips, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and three flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 42% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4GHz.
Introduction
CFD (Computational Fluid Dynamics) has been widely utilized in the design and optimization of fluid flow applications. In aerospace industry, CFD is a cost-effective design tool for aircraft components. It presents methods to solve and analyze problems of the physical phenomena of fluids involving fluid flow on discrete space and time. Ground test facilities do not exist in all flight regimes covered by such hypersonic flight. No wind tunnels that can simultaneously simulate the higher Mach numbers and high flow-field temperatures to be encountered by trans-atmospheric vehicles. Hence, CFD has become the major player in the design of such vehicles. In addition, compressible flow simulations are of vital importance to the aerospace engineering community, which will always seek more accuracy and higher resolution creating a demand for faster codes and making the use of high performance computing strategies invaluable.
A typical simulation platform in the aeronautics industry consists of a CFD specific software application, normally written in a high-level language. FaSTAR (Fast Aerodynamics Routines), a compressible flow analysis code developed by JAXA (Japan Aerospace Exploration Agency) is one of such CFD software packages [6] . FaSTAR adopts unstructured mesh as grid data in the simulation. Although it is a convenient tool for aerodynamics analysis, it sometimes takes several days or weeks when an analytical area grows large. This is mainly caused by low parallel processing efficiency accompanied with pointer links and a complicated memory access pattern. In FaSTAR, the increasing demands for accuracy and simulation capabilities produce an exponential growth of the required computational resources.
Recently, reconfigurable systems using FPGAs have been utilized for acceleration of specific applications including bio-informatics and financial problems [8, 9] . Even though the early reconfigurable systems did not focus on large-scale numerical scientific application, the use of FPGAs for such areas has been growing remarkably because of the rapid performance improvement of modern FPGAs with a large number of configurable logic blocks, memory blocks and embedded multipliers. However, although some research works using FPGAs achieved significant speed-up ratio to the software, targets were simple programs rather than practical software packages.
The goal of our research is to improve the performance of FaSTAR by using FPGAs as a reconfigurable platform. Previously, we have proposed out-of-order mechanisms to cope with unstructured mesh for efficient execution of FaSTAR [1] . Another problem of FaSTAR is its versatile functions using a lot of solvers. For implementation of the whole package, a lot of expensive FPGAs each of which has a certain level of size are required. Using partial reconfiguration is a hopeful strategy for reducing the total cost. This work is the first trial of applying the partial reconfiguration technique to a practical scientific computation package. The total required hardware becomes large if all functions of the package are implemented in a single design. We can avoid it by providing a set of subset designs with only required functions in advance. However, the number of designs to be prepared in advance becomes large for a complicated application package, since there are several parts each of which has selectable functions and required numbers of subset designs are their products. By introducing the partial reconfiguration, we can quickly prepare the appropriate design for the user. This trial is the first step of using partial reconfiguration technique for a large practical scientific package, and it has a possibility to extend the application field of the partial reconfiguration for this field. Here, as a target function, we select the advection term computation, which is a time consuming large function in FaSTAR.
The rest of this paper is organized as follows. Section 2 discusses related work to this study. Section 3 is an explanation regards to FaSTAR and target subroutines. Section 4 introduces the theory of implemented flux calculation schemes. Section 5 describes about the implementation of this work. Then, following to Section 6 for performance evaluation, Section 7 discusses about future work, and Section 8 summarizes this work with conclusion.
Related Work
In fluid dynamics simulation, execution time is one of the largest concerns. There are various studies using FPGAs reported so far to examine the issue. Andres the result for FPGA accelerations [2, 15] . However, their implementations are not for practical software packages. Another result is reported for implementation of FPGA flow solver based on the systolic architecture for CFD [14] . This work proposed a systolic algorithm for the fractional-step method employing the central difference schemes. Although satisfactory performance is achieved, their implementation focuses on systolic algorithms. Partial reconfiguration technology has attracted many researchers for filling the gap between hardware and software for algorithm implementation. Recently, in computer security applications, partial reconfiguration is applied in AES (Advanced Encryption Standard) algorithm implementation [7] . It also had been used to accelerate video processing in driver assistance system [3] . Also, a few researches had been done on an aerospace applications using partial reconfiguration method. LaMeres et al. [10] designed and prototyped the computing architecture which dynamically reconfigures the system depending on the environment. Another work has proposed the SoCWire architecture verification, test and results on network on chip for safe dynamic partial reconfiguration in spaces applications [12] .
Our previous trial [16] , was the first example of using partial reconfiguration in CFD implementation. However, it was applied to small limiter functions in a subroutine called MUSCL used in UPACS [18] . Although the total hardware is reduced, the effect is limited since it only occupies a small portion of the target FPGA.
FaSTAR
FaSTAR is a CFD software package developed by JAXA to simulate compressible flow using unstructured grids [6] . FaSTAR consists of many solvers with multiple solutions. Its source code is written in Fortran 90 with message passing interface (MPI). By choosing certain solvers, users can select various solutions supported by the application and run simulation in parallel with their systems without specific software tunings. Users just are requested to prepare parameter file and grid data file before the simulation. Table 1 shows example solvers available in FaSTAR. By combining these solvers, a user can simulate with desired solutions. When the partial reconfiguration is applied to selectable solutions, first, the profiling is required. Then, the part in which multiple schemes are available is picked up from the subroutines, which take a long computation time. 
Target Subroutine
As a first step, we profiled the execution time of FaSTAR to find out which routines consume the highest percentage. Compiler used is Intel Fortran Compiler 10.1 on Intel Core 2 Duo processor at 2.66 GHz with Linux Kernel 2.6. The profiling results are shown in Figure 1 .
Code for a single core without MPI was compiled. The result indicates that more than 60% of the total execution time is occupied by two calculations part: limiter and advection term. FaSTAR limiter part is difficult for implementation in reconfigurable hardware because of its complicated iteration. Here, we selected the advection term, since it occupies a large part of the total computation time, and has a selectable function whose hardware requirement is relatively large.
Advection term is consisting of three subroutines:
• pre-processing,
• flux calculation, and
• surface integral of flux.
In pre-processing, data of a specific cell number are prepared. Using these data, flux is obtained by applying to scheme equations in flux calculation part. Here, we try to apply partial reconfiguration to make reconfigurable scheme selection as a focus of this paper. The finite volume method for discretization of the space is processed in surface integral of flux. We have reported our study for this part in [1] .
Flux Calculation Scheme
In flux calculation subroutines, there are three schemes available for selection: Roe's scheme, HLLE scheme and HLLEW scheme [13, 4, 11] . We describe how to compute the inviscid flux using Riemann solver approximation. As shown in Figure 2 , conserved quantities, Q na and Q nb in cell A and cell B were used to determine flux, F n . In this case, with the respective to cell A, the normal vector, d s has same right direction with flux. On the other hand, with respect to cell B, the opposite direction is positive.
Roe's Scheme
The method proposed by Roe [13] is based on Euler's equation into windward of the linearization. The numerical flux function can be written as follows: In this example, F n is evaluated when it is viewed from the cell A. If it is viewed from the cell B, negative sign is added to become -F n . In detail we may write:
That is, the way to obtained f (Q nb ) and Q nb is the same. In addition, A is the Jacobian matrix for flux,
R is the matrix with right eigenvector, Λ is matrix with eigenvalue and R −1 is matrix with left eigenvector. The matrix with subscript ave, is composed by variables which obtained by Roe average. The Roe average can be expressed as follows:
Here, c is the speed of sound, H is the total enthalpy per unit mass, and γ is specific weight representing the force exerted by gravity on a unit volume. Therefore, γ = ρ * g. Then, we can write the matrix in details as follows:
Here, if ave subscript can be omitted, u = u n , v = u t1 and w = u t2 .
HLLE Scheme
Einfeldt [4] discussed an adapted version of the HLL scheme [5] , called HLLE (Harten-Lax-van LeerEinfeldt) scheme, which can be considered as a modification of Roe's scheme. This scheme is a stable procedure that is solved by approximation of two characterized waves, which are valid for the flow with a strong expansion. However, there is a disadvantage that the numerical viscosity is large. Total flux can be calculated using the following equations:
where,
In this example, we evaluated F n when it is viewed from the cell A. Again, if it is viewed from cell B, it will become -F n . The average value b, is calculated using Roe average in Eq. 5.
HLLEW Scheme
Obayashi and Wada proposed a new, modified HLLE scheme that satisfies the positively conservative condition called HLLEW (Harten-Lax-van Leer-Einfeldt-Wada) scheme [11] . The numerical flux can be calculated using the following equations:
Coefficients, δ 2 and δ 3 are obtained using the Roe average in Eq. 5. Same as Roe and HLLE schemes, we evaluated F n when it is viewed from cell A. 
Design and Implementation on an FPGA
Xilinx Virtex-6 FPGA (XC6VLX240T-1FF1156) is chosen as a target device, which supports partial reconfiguration. In a large software package, a subroutine itself is not always appropriate as a target of partial reconfiguration. For example, all three schemes treated here include Roe average calculation and so it must be implemented as a static module. Before the design, the target is well re-structured so that the static module and partial reconfiguration modules are appropriately separated. Overview of the system is shown in Figure 3 . At the beginning, the system will initialize and updates mesh size for each grid data and face index. Then, it will calculate Roe average value in Roe average module since this value is needed for all schemes. Inputs for the Roe average module are density, ρ, velocity, u and total enthalpy per unit mass, H. The result of Roe average module is directly inputted to the flux calculation module. Meanwhile, grid data file of the cell number and index are also inputted to the flux calculation module.
After that, in the reconfigurable module, users can choose which scheme they want to use. Partial reconfigurability of the FPGA and intractability of the bitstream meet the requirements of this system. Each scheme module has the same inputs and outputs as shown in Figure 4 , thus, it can be specified in the HDL description as the functional modules with the reconfigurable partition attribute in the description of the top module. Multiple instances corresponding to the schemes are defined for such a single functional module. Software tools as NGDBuild, MAP and PAR detect the reconfigurable partition attribute on the instance, and process it correctly. RAM is allocated in each scheme to store variables values during calculation. It is built with Block RAMs, in which data are stored temporarily.
Programmable input/output (I/O) module is designed to control the access to memory. A result of flux calculation module is stored in memory. We implement simple dual-port RAM for each adjacent cell, cell A and cell B, shows as memory A and memory B. Here, Block RAM used is 36 Kb block, RAMB36E, which is configured in a simple dual-port RAM mode. Read/Write data width is set to 64 bit. Read/Write process is performed in parallel with the flux calculation module. Summation of all cell flux values gives the total flux. .0 is an IP core for handling floating-point operations, and it is configurable by the user specifications. Xilinx CORE Generator is used to provide the core for floating-point arithmetic units. So as to generate high performance computation unit, the level of DSP48E usage is set to the maximum to get the desired output. In order to demonstrate that our system works on a real FPGA, Xilinx ML605 board is used with 200 MHz operating frequency. Finally, for programming the FPGA, Xilinx iMPACT software is used. Summary for the implementation environments is shown in Table 2 .
At one time, only one scheme is used and employed in the FPGA. The top, static and reconfigurable modules are consisting of many arithmetic functions. The parameters used for each computing unit are shown in Table 3 . For high performance computation, adder and subtractor are set to 14 clock cycles per operation using high-speed mode. In addition, the multiplier takes 16 clock cycles with 11 DSP48E usages. The latency of square root and divider is set to be 57 clock cycles. Although it is possible to decrease the square root and divider pipeline latency, it will severely degrade the clock frequency. Latency for the comparator unit is the fastest; it only requires one clock cycle.
We used bottom-up synthesis technique to synthesize the design by modules. This synthesis technique requires that a separate netlist is written for reconfigurable partition ensuring that each portion of the design is synthesized independently. Top and static module are synthesized with black box for the reconfigurable partition. In this case, flux calculation scheme module is defined as a black box in a top module synthesis. Roe, HLLE and HLLEW scheme modules are synthesized beforehand to provide the required netlist. Next crucial step is to perform manual floorplanning for reconfigurable partition, which requires knowledge of the physical architecture of FPGA and understanding of how to floorplan for optimal performance and area. The challenge is how to create and pack large flux calculation schemes into a single partition. The partition boundary is defined so that the inserted proxy logic and the extra wiring cost may not degrade the total performance. Although irregular shaped partition such as T or L shapes is allowed, placement and routing in such regions sometimes degrade the performance because of the shortage of the routing resources and long wires. Therefore, we chose a certain size rectangular shape for flux calculation scheme as shown in Figure 5 , which is adequate for all 3 schemes. Figure 5 shows a floorplan of the system with HLLE scheme is deployed. This is important for all reconfigurable modules to have enough resources to fit in the partition when the bitstream is loaded. After that, timing constraint entry and design rule checks (DRC) are performed.
Roe Average Module
As shown in Section 4, all schemes are needed of Roe average values before the flux computation is processed. Therefore, Roe average module is decided as a static module since it will be used in all cases. In this module, managing parallelism is an important issue. The FaSTAR source code for this calculation in Fortran 90 is written as follows: The code is executed sequentially from top to the bottom. The advantage of sequential operations is that they efficiently use the resources, whereas parallelism can be used to reduce the time to completion and get the Roe average values at the expense of additional hardware resources. The idea behind control parallelism is that the statements used to compute u ave (UAV), v ave (VAV), w ave (WAV) and H ave (HAV) can be performed simultaneously while still producing the correct answer. A scheduled data flow graph as shown in Figure 6 represents a data dependencies between operations. After SQRT output to get the value of RAT, all multiplications operations are executed in parallel. At this stage, ρ ave and RAV are obtained. Then, after RATI value is obtained, all UAV, VAV, WAV and HAV can be also computed in parallel. However, CAV cannot begin until QA2 and CA2 finish. At the beginning to compute RAT will take a large number of clock cycles, since square root and divider computing units require a large number of clock cycles each. Therefore, we implement pipeline datapath to address this issue. Implemented pipeline datapath for Roe average module is shown in Figure 7 . Registers are inserted in the dot line to create a single stage pipeline.
Roe Scheme Module
Roe scheme module is implemented in reconfigurable partition as a reconfigurable module. This module is designed and synthesized separately from the top module. Roe calculation scheme involves 5 steps to get the results:
1. Compute matrix R.
Compute matrix
R −1 .
Compute matrix |Λ|.
4. Compute Jacobian matrix |A|.
Compute Roe's numerical flux.
The main arithmetic operation of this module is a 5 × 5 matrix multiplication as shown in Eq. 4. In general, the standard matrix multiplication C = A × B is defined as follows: Where A, B and C are M × N, N × R, and M × R matrices, respectively. Since all matrices are 5 × 5, the pseudo code for matrix multiplication is shown in Algorithm 1. However, it requires two times matrix multiplication to obtain the Jacobian matrix, which utilizes a lot of resources.
However, computation of each matrices R, R −1 and |Λ| are done in parallel. Then, matrix R is multiplied with matrix R −1 . Result of this matrices is multiplied with matrix |Λ| to obtain Jacobian matrix. Finally, Jacobian matrix is used to compute the numerical flux.
We implemented a MAC (Multiplication and Accumulation) unit structure that couples the multiplication and the accumulation closely as shown in Figure 8 . The multiplier receives the elements of A and B in a data driven manner. That means whenever both of the data are available, they will enter the pipeline. After the multiplication, the result is stored in FIFO and loaded address is generated. The next multiplication result will be added with the prefetch data from FIFO, and accumulated results are stored in temporary memory. This operation strategy is repeated continuously until calculation finishes.
HLLE Scheme Module
HLLE scheme module is rather straight forward compared to Roe scheme module. This module is also defined as a reconfigurable module same as Roe scheme module. Therefore, it is designed and synthesized separately from the top module to produce required netlist. HLLE calculation scheme requires 2 steps to get the final result:
1. Compute HLLE coefficients and eigenvalues. 
HLLEW Scheme Module
Apparently, HLLEW scheme is a modification of HLLE scheme. This scheme is also based on Roe scheme as well as HLLE scheme. However, to compute the HLLEW flux, there is no need to do a matrix computation as suggested for Roe scheme. HLLEW scheme is also implemented as a reconfigurable module in reconfigurable partition. Therefore, it is synthesized beforehand and separately to provide the required netlist to the top module. Computation of HLLEW flux requires the following steps:
1. Compute coefficients and eigenvalues. 
Implementation Issues
Our challenge is to implement large scale scientific computation using partial reconfiguration. Careful design requirements and considerations are carried out. At the same time, the design specification must be analyzed thoroughly, and the limitations associated with partial reconfigurable designs are considered. In addition, difficulties must be resolved to improve the quality of the design. The challenges and solutions on implementation are listed as follows:
I/O in each scheme module
Flux calculation schemes must include the I/O circuitry, input buffer (IBUF) and output buffer (OBUF) that are required to connect internal logic to package pins. In other words, the I/O features must be completely contained within the scheme module, but the port list for the complete design remains at the top-level design description. Besides, the limitation of the I/O pins of FPGA must be considered, since flux calculation module requires many I/O.
DSP blocks in each scheme module
It is also important that the physical region selected has adequate resources especially DSP48E for all schemes. Flux calculation scheme requires a lot of DSP blocks to perform the computation. Therefore, we properly set the last blocks occupied in both end column of reconfigurable partition are DSP blocks, instead of slice or block RAM. Using this strategy, we can maximize DSP blocks in the partition.
Interaction with CORE Generator
Since we used CORE Generator to generate all computing units, netlist-based cores were created to be instantiated in the design. To make sure these cores can be instantiated easily, the boundaries of flux calculation scheme partition is not modified. We also made considerations for the definition of the flux calculation scheme region to ensure the proper elements are contained within.
Optimization
In order to optimize the design time for bit file generation, the most complicated and highly resource consuming design should be selected first. Here, Roe scheme is done first, then HLLEW follows, and HLLE is done last. 
Evaluation
In order to demonstrate the effectiveness of our design, we used a sample data of NACA 0012 airfoil. The National Advisory Committee for Aeronautics (NACA) develops the NACA airfoils shapes for aircraft wings. The grid data consisting of 11,564 grids with 22,883 faces are used in our study. We evaluated the used resources, configuration speed for full and partial reconfiguration, and system performance.
Resource Utilization
The amount of required slice registers, slice LUTs, DSP48E and BlockRAM are evaluated when the design is synthesized. There are 3 design options for implementation consideration. Result of the resource utilization is shown in Figure 11 . Obviously, the first option is to fit in all modules in a single FPGA, shown in first column noted by "Full". The main advantage of this method is that it only requires one time configuration and no reconfiguration is needed. However, it requires a large amount of hardware that is not enough in a single FPGA. Total resource utilization for registers and LUTs usage exceeds 100%. Registers and LUTs required are 119% and 142% respectively. Second strategy is implementing only a scheme in one design. Means, for three schemes, there are three difference designs. This is shown in second, third and fourth column noted by "Roe Full", "HLLEW Full" and "HLLE Full". Although there are enough resources to do this, two disadvantages arise. First, it will require two times full reconfiguration if a user wants to change from one scheme to another. Second, resource is overused since the same design is used again except the flux scheme module.
The third option, which is our proposed method, is to utilize partial reconfiguration. This is shown in fifth column noted by "Partial". Top, static and reconfigurable modules are fixed in a single FPGA. Flux calculation schemes bitstreams are stored in the host PC. When users want to use any particular scheme, it is loaded to an FPGA. Consumed resources when no scheme loaded is small. In addition, consumed resources when system is in use and one scheme loaded is the same with the second option. However, another advantage of this technique is the configuration time. If users want to change from one scheme to another, it can be faster compared to the full reconfiguration.
We examined the amount of resources utilization for each design option. We found out that all modules cannot be implemented in single Virtex-6 XC6VLX240T-1FF1156 FPGA since resources available are not enough to accommodate all modules. For the second option design, Roe, HLLEW and HLLE schemes are implemented separately. Although there are enough resources to implement this, all designs require full reconfiguration to load in an FPGA independently. In partially reconfigurable design, bitstreams of Roe, HLLEW and HLLE schemes are stored in host PC. Therefore, maximum resources reduction is measured when Roe scheme is deployed since it requires the highest resources. On average, consumed resources for "Full" design is 98.25% while for partially reconfig-urable design when Roe scheme is deployed is 56.25%. As a result, resource utilization is successfully reduced by 42% on average.
Even though the resources are not enough to implement all modules in a single FPGA, there are resources overhead in partially reconfigurable design. This is because all schemes modules are implemented in reconfigurable partition. The partition is manually floorplaned and resources allocated are fixed to fit in all modules. Since Roe scheme occupied higher resources than HLLEW and HLLE schemes, the reconfigurable partition has wasted resources when HLLE or HLLEW schemes are loaded. However, unused resources in these schemes are used again when Roe scheme is selected.
Configuration Time
The configuration time for the full reconfiguration and partial reconfiguration is compared. The speed of configuration is directly related to the size of the partial bit file and the bandwidth of the configuration port. Since we use JTAG configuration port, for Virtex-6 device [17] , configuration time is given by:
where bits in bitstream is size of the configuration bitstream in bits and TCK frequency is maximum configuration TCK (Test Clock ) frequency and used for boundary-scan operations. 2044 is the total number of clocks needed for pre-processing and post-processing for single device configuration sequence while programming the bitstream to FPGA. Although the maximum bandwidth available is 66 Mbps, we found out that while configuring the FPGA using iMPACT, used bandwidth is 16.7 Mbps and data width is 1 bit. In a full reconfiguration, total bitstream size is 9,017 KB. Based on the given formula, the configuration time is equal to 4.423 sec. On the other hand, bitstream size for each Roe, HLLE and HLLEW scheme is 1422 KB. This means configuration time for partial reconfiguration is 0.704 sec. In short, the partial reconfiguration method accelerated the configuration speed by 6.28 times.
There is no overhead for partial reconfiguration since the users must decide which scheme they want to use before calculation start. Therefore, configuration time will not affect the computation time in FPGA. In addition, configuration time will not cause a bottleneck to the system when the grid size grows large and takes a lot of iterations. This is because all flux schemes bitstream is fixed and not affected by large input size.
Performance
Flux calculation scheme is implemented and the total clock cycles are measured. Total clock cycles for each Roe, HLLEW and HLLE scheme are 205600×10 3 , 197400×10 3 and 191200×10 3 respectively. The execution time for flux calculation scheme in software is compared with the execution time by hardware. The result summary is shown in Figure 12 .
In software, all schemes are executed by Core 2 Duo 2.4GHz with Linux Kernel 2.6.18 operating system. All schemes are compiled by using Intel Fortran Compiler 10.1. Execution time is measured by using call system clock prepared in Fortran 90 language. We found out the execution time took 5.400 sec for Roe scheme, 4.533 sec for HLLEW scheme and 4.399 sec for HLLE scheme. This is show in first column of Figure 12 , noted by "Software".
In hardware, since we know the total clock cycles required from beginning to the end, and operating frequency for the FPGA is 200 MHz. Therefore, computation time by FPGA for Roe scheme is 1.028 sec, 0.987 sec for HLLEW scheme and 0.956 sec for HLLE scheme. Adding the configuration time and computation time resulting an execution time in hardware. Therefore, the second column of Figure 12 shows an execution time in FPGA if second option design of full reconfiguration (FR) is deployed, noted by "FPGA FR". 5.379 sec respectively. Third column of Figure 12 shows an execution time in FPGA if partial reconfiguration (PR) is used, noted by "FPGA PR". Adding 0.704 sec configuration time to each scheme computation time give an execution time of Roe, HLLEW and HLLE scheme are 1.732 sec, 1.691 sec and 1.660 sec respectively. Acceleration speed between hardware and software is compared when HLLE scheme in partially reconfigurable design is deployed since it is executed fastest in software. Therefore, the execution time of FPGA is 2.65 times faster compared to the software execution. Full reconfiguration design strategy for each scheme gives an almost same performance produced by software. In fact, software execution time for HLLE and HLLEW schemes are faster than FPGA. However, partial reconfigurable design approach gives a 2.65 fold speed-up compared to software execution. Therefore, taking configuration time into account, performance improvement using partial reconfiguration method is justified.
In this implementation, there are three flux calculation schemes for selection. As discussed in Section 6.1, it is possible to design three different schemes separately. However, this design has following three advantages. Firstly, practical software package has various parts, which selectively use several functions. If there are M different parts each of which has N i functions, the total number of subset designs becomes ∏ M i=1 N i . By using partial reconfiguration, the users can select their favorites combination easily on demand. Secondly, when we want to add another scheme to the implementation, it is easy to add another reconfigurable module instead of modifying the whole design. Finally, in advection term computation, there is no obvious scheme for every particular flow problem. In other words, flux calculation scheme must be able to yield stable and provide solutions under various flow conditions. Each scheme has their advantages and disadvantages. Therefore, by trying one scheme to another quickly will help users to draw best conclusion. Thus, reducing the switching time by using partial reconfiguration will be helpful.
Future Work
In this work, our target is only flux calculation scheme in advection term computation. Therefore, many other routines in FaSTAR are still untapped for further exploration. We must try to extend partial reconfiguration technique more aggressively using other modules that are available in FaSTAR. Instead of single partition, multiple reconfigurable partitions in single FPGA are good design strategy for future research. However, this work shows that single FPGA can fitted only one subroutine in advection term computation.
While the limitations of a single FPGA are noticed, multi-FPGA platform with multiple reconfigurations can be the next target for future work. They offer the potential to mega-boost the capacity of resource in FPGA as well as many more modules in FaSTAR can be reconfigured such as flux evaluation, flux limiter function or convergence acceleration. This approach also hopeful because FaSTAR computational intensive part must be performed iteratively, and that can be parallelized for high-performance using multiple FPGA. If this target is realized, it is possible to implement the whole FaSTAR package on FPGAs. However, a lot of works need to be done such as the interconnection between FPGA, data transfer and synchronization is another issues to solve before successful implementation could be achieved.
Conclusion
In this study, the efficient use of partial reconfigurability in recent FPGAs is explored. Advection term computation is chosen as a target, and flux calculation scheme is deployed as a reconfigurable module. Three flux calculation schemes are implemented: Roe, HLLE and HLLEW schemes.
The implementation using partial reconfiguration platform has successfully reduced required hardware resources, improved configuration time and its performance. Resources utilization is saved up to 42% on average. The proposed design also improves the configuration time by 6.28 times faster and accelerates the system at least 2.65 times in performance.
