Accurate simulations of various physical processes on digital computers requires huge computing performance, therefore accelerating these scientific and engineering applications has a great importance. Density of programmable logic devices doubles in every 18 months according to Moore's Law. On the recent devices around one hundred double precision floating-point adders and multipliers can be implemented. In the paper an FPGA based framework is described to efficiently utilize this huge computing power to accelerate simulation of complex physical spatiotemporal phenomena. Simulating complicated geometries requires unstructured spatial discretization which results in irregular memory access patterns severely limiting computing performance.
working in parallel providing 90 times speedup compared to a high performance microprocessor core.
Introduction
Numerical simulation of complex problems evolving in time plays an important role in scientific and engineering applications. Accurate behavior of dynamical systems can be understood using large scale simulations which traditionally requires expensive supercomputing facilities. Several previous studies proved the efficiency of programmable logic devices in numerical simulation of various physical phenomena such as electromagnetic [1] , transient wave [2] [3] and computational fluid dynamics [4] [5] simulations.
A complex spatio-temporal problem can be approximated in a regular mesh structure. To get a more accurate solution the the resolution of the mesh should be increased, which increases the number of grid points and the running time as well. To overcome the outlined problem an unstructured mesh will be used. The resolution of the mesh will be increased where it is required by rapid change in the dynamics or shape of the problem. Conventional microprocessors has around 10% utilization during unstructured mesh computations due to the irregular memory access pattern of grid data.
To accelerate the computation irregular memory access patterns should be hidden by temporally storing the relevant grid points on the FPGA. Due to the finite on-chip memory size mesh points have to be ordered and stored according to the requirements of computation. A very efficient new mesh point ordering method was developed.
To obtain a high performance accelerator the operating frequency of the arithmetic unit is critical.
By using local controls to clusters of arithmetic operators the clock speed can be significantly increased paying with some area increase for it. The data transfer between the clusters are synchronized by FIFOs. It was proofed that a good tradeoff can be achieved if a cluster has about 10 I/O FIFOs.
To define the clusters several partitioning algorithms [6] [7] were carefully tested and compared. The placement and the routing of the clusters however were not as good as it should be. Especially due to the long interconnections the clock frequency could not increased further. To eliminate this kind of problem a new combined the clustering and placement method will be shown and tested in a complex fluid dynamical problem.
In Section 2. an overview of recent publications on accelerators working on unstructured meshes are given. A new architecture for efficient unstructured mesh computations is proposed in Section 3. A new node reordering algorithm is presented in Section 4. and a new partitioning algorithm is descried in Section 5. The proposed methods are tested on a complex case study described in Section 6. Finally performance of the proposed architecture is compared to a high performance microprocessor in Section 7.
Related work
Several papers are published in the early 2000s dealing with the acceleration of PDEs on unstructured meshes. Most of them are focused on accelerating Finite Element Methods (FEM) where the global stiffness matrix is given and the resulting large linear system of equations are solved usually by the iterative Conjugate Gradient (CG) method. The most time consuming operation of this method is a sparse matrix vector multiplication therefore most of the papers try to accelerate this part. Though our architecture is designed for explicit unstructured finite volume calculations examination of these architectures is helpful because similar problems arising in our case such as the adjacency matrix of the mesh is sparse and elements of the state vector are accessed in a random order.
In 2000 Jones and Ramachandran [8] examined several aspects of accelerating unstructured mesh computations on FPGAs. They proposed a hybrid architecture to accelerate the CG algorithm where the local stiffness matrix and the bulk of the CG algorithm is computed by the CPU and only the matrix vector multiplication is performed on the FPGA and achieved 22.4 − 35.7MFLOPs computing performance.
Another approach is the architecture proposed by deLorimier and DeHon [9] where all elements of the matrix are stored on the FPGA to avoid bandwidth limitations but this solution severely limited the size of the matrix. Performance of the architecture is depended on the structure of the matrix and usually 66% of the peak performance can be achieved which results in 2-10 times acceleration compared to the common microprocessors at that time.
Elkurdi et al. [10] first reorganize the finite element sparse matrix into a banded form then calculate the matrix vector multiplication along special pipelineable diagonal stripes, where two successive elements can be processed by a pipelined architecture. Performance of the architecture is determined by the available memory bandwidth and the sparsity of the matrix however utilization of the processing elements is varying in a very wide range between 17.74 − 86.24%. duBois et al. [11] presented an architecture where nonzero elements from each row of the sparse matrix are processed in 7 element wide vectors. They also proposed to use a reordered banded matrix to improve data locality, but the architecture still suffer from memory bandwidth limitation.
Recently Nagar et al. [12] proposed an architecture using an optimized Streaming Multiply-Accu-mulator with separate cache memories for matrix and vector data. The implementation platform they used has special memory architecture providing high 20GB/s peak memory bandwidth. Performance of the system with four FPGAs is in the 1.17 − 3.94GFLOPs range outperforming a Tesla 1070 GPU.
However utilization of the PEs is around 50% similarly to the previous architectures and increasing the number of PEs to occupy the entire FPGA still runs into a memory bandwidth limit.
The surveyed architectures provide general solutions to accelerate FEM calculations on FPGAs but suffer from the inherent high memory bandwidth requirement and small communication to computation ratio of sparse matrix vector multiplication. On the other hand utilization of the execution units depends on the structure of the sparse matrix.
In the case of finite volume discretization irregular memory access pattern and high memory bandwidth requirement can be eliminated by storing a small part of the grid on-chip and reordering its nodes.
Right hand side of the discretized equations should be computed for each node in every timestep, which requires several floating-point operations, resulting in better communication to computation ratio and higher utilization of FPGA resources.
Architecture
Time evolution of dynamical systems can be easily solved numerically by explicit finite volume discretization and the solution is computed on a discrete set of points in space. The solution is advanced in time by approximating the derivative of each node using linear combination of state values from its small local neighborhood. Computation of subsequent grid points is independent however some parts of its input data sets are overlapped. For practical applications the number of grid points is far exceeding the available on-chip memory of recent FPGAs therefore state and constant values must be stored and loaded from an off-chip memory. Overlapping input data sets can be utilized to reduce the number of memory accesses and save memory bandwidth by saving all nodes which required during the following computations into a local memory on the FPGA. Simple example for the values to be stored is shown in Figure 1 .
Each node is numbered and the nodes are updated in an increasing order, bold numbers indicate the already processed nodes, encircled node 8 is the currently processed vertex, while squared nodes are stored on the FPGA. In this case all elements required for the computation of node 8 can be read from the fast on-chip memory of the FPGA. To process the next node a new node 15 should be loaded and node 2 can be discarded. When updating node 9 all the required data will be available in the on-chip memory. It is possible that multiple new nodes are required for the update of a node Node data is treated as a 1D array where each element contains two separate parts one for storing time dependent state values and one to store constant data such as physical constants and coordinates of the grid points. In case of structured grids neighbors of each grid point can be determined by its array index. For unstructured grids connected grid points are described by a sparse adjacency matrix which is usually stored in a Compressed Row Storage (CRS) format [13] . In our case the matrix contains only 0 and 1 elements therefore only column indices are stored. Additionally the elements are read in a serial sequence hence row pointers can be replaced by a single bit to indicate the start of a new row. It is possible that the discretization stencil is defined on triangle or tetrahedron instead of a line between two grid points. In this case an additional element descriptor is required where node indices of the vertexes of the element are stored. Data structures used by the system are shown in Figure 2 . Additionally an example data structure is filled with connectivity and element data for node 8 from Figure 1 . Width of the index field can be set according to the requirements of the application.
More than 8 million nodes can be handled by using 24 bit wide indices which can be sufficient in many applications. Neighbors of the currently processed nodes are stored in the Neighborhood memory unit. Its structure is depending on the particular discretization stencil. In the simplest case when the stencil is defined on a line it can be a simple register. When the stencil is defined on a triangle or tetrahedron a one write two or three read multi ported memory is required respectively. Size of this memory is depending on the largest node rank in the mesh, usually a 64 element memory is sufficient, which can be efficiently implemented using the Distributed RAMs on FPGAs. Computation of the updated node value is carried out in an element by element order by the arithmetic unit.
Computation is started by loading node values into the Memory unit until it is half filled. In this phase value of the first node is loaded to the Current node register and the Neighborhood memory is filled by its neighbors using the incoming connectivity descriptors. Global indices of the neighboring nodes are translated into addresses in the Memory unit by the Local address generator unit. When all neighbors are loaded valid stencil data can be send to the arithmetic unit in each clock cycle.
Neighborhood of the second node can be loaded into the free space of the Neighborhood memory while processing the first node. When all parts of the first stencil are sent to the arithmetic unit Node addressA register is incremented and the second node is loaded into the Current node register. During the next clock cycle new node data can be written into the Memory unit and computation of the The main advantage of the proposed architecture is the serial off-chip memory access pattern on the node data and descriptor arrays. Each array is read into a FIFO buffer by using an optimal burst length and penalties of random memory access patterns can be eliminated. Maximum size of the mesh is limited by the bandwidth of the adjacency matrix. Using today large, high performance FPGAs even 10,000-40,000 nodes can be stored on-chip and depending on the structure of the mesh its size can be in the 100,000-400,000 node range.
Memory Bandwidth Optimization
Efficient use of on-chip memory resources of the architecture described in the previous section is depending on the bandwidth of the adjacency matrix of the mesh. By reordering the nodes of the mesh the bandwidth of the adjacency matrix can be significantly reduced. In this section we show a fast constructive method for reordering, which is comparable to the classical algorithms, and we present a solution for generating memory access patterns which have lower bandwidth than a given bound. 
Description of the problems and related works
, and the bandwidth corresponds to G with labeling f() is
The problem of bandwidth reduction of a sparse matrix was shown to be NP-complete by Papadimitriou [14] , so exact methods can not be applied for large problems. Many heuristic algorithm was developed from the well-known Cuthill-McKee (CM) algorithm [15] to recent metaheuristic approaches. One of the most promising metaheuristic results in solution quality is the GRASP (Greedy Randomized Adaptive Search Procedure) with Path Relinking, which was presented in [16] . The most practical solution is the GPS(Gibbs, Poole and Stockmeyer) method [17] , which is one of the fastest heuristics, and also provides good solution quality. Metaheuristic methods gives better solutions, but their time consumption many times higher than GPS. GPS algorithm was born in 1976, afterwards many attempts shown to improve the original method, the most interesting from them is the method created by Luo [18] . All of the improved GPS heuristics have higher time complexity, which is an important parameter in case of large meshes.
Serial-Bandwidth:
Using the classical definition of bandwidth, the size of the on-chip memory can be given as (B f (G) * 2 + 1) * sizeof (node), where (B f (G) * 2 + 1) called C BW. If we assume the architecture described in previous section, a more proper definition than C BW can be given. S BW means the number of nodes which have to be stored in on-chip memory, if we want 0% cache-miss.
Algorithm for bandwidth reduction
Several methods have been shown in the literature for minimizing the classical bandwidth of a system.
In this section we define Amoeba1(AM1) algorithm for serial-bandwidth minimization. Our goal is to create a fast, effective constructive method which has proper, easy to calculate S BW bounds in each construction step(details in next subsection).
Notations and definitions
Amoeba1 is a constructive method, in which a solution element is chosen and labeled in each step.
Solution elements are the vertices of the input mesh, and the method grows a part till all of the vertices are covered. s(i): is the distance between node(i) and its lowest indexed neighbor in the part:
u(i): is the set of nodes which uncovered by P, but must be added in later steps because of node(i):
I: is the index of the first elemet which has not empty u() set, so for every node(i) where i < I all neighbors covered by P.
With these parameters we can give bounds on the serial bandwidth. In AM1 method we use a simple lower bound for describing the importance of node(i):
This is obviously a lower bound, because if we add node v / ∈ u(i) to part P we still have to add all elements of u(i) to the part. For every node(i) i < I imp(i)=0, because these nodes have all of their neighbors involved, so their effect on bandwidth do not depend on the later decisions.
Description of Amoeba1
Amoeba1 algorithm has two base steps: finding a starting vertex, and the labeling loop. The result is an ordering of the vertices.
Finding a starting vertex The quality of the result of constructive bandwidth-reduction heuristics depends on which is the starting vertex. In GPS method, the authors presented a simple and effective solution for this problem. They gave an algorithm which returns the two endpoints of a pseudodiameter. AM1 algorithm uses this subroutine for finding the starting vertex.
Choosing a solution element AM1 selects a node from u(I), which has a neighbor in P with maximal importance. Because all nodes in u(I) has node(I) as its neighbor, only l = node(I) neighbors take part in the search. 
Results and conclusions
AM1 is a simple constructive algorithm for large problems, we compare its results to the classical fast and effective GPS method. As mentioned earlier, better quality algorithms exist for bandwidth reduction, but these methods can not be applied to large meshes(¿100.000 vertex) because of their complexity. The cases showed on Table 1 comes from 2-dimmensional meshes, with assigning a vertex to each triangle and an edge between vertices which are represent adjacent triangles, so we get a mesh with maximal degree = 3. This meshes appears when we use finite volume solver during the solution of a partial differential equation. In these low-degree cases AM1 provides similar solution quality to GPS, in 4% less time. The running time of both method depends on the number of vertices, and the structure of the mesh(finding a starting vertex).
The results for high-degree(20-30) cases can be found on Table 2 . These cases generated from the same complex 3D geometry, by increasing the density of the mesh. We found that GPS is 29% superior on these general instances, but 13% slower than AM1. These results shows, that the difference is not increasing with the complexity of the problems. In our further work, we want to increase solution quality by guiding the labeling depending on the knowledge of the geometry. 
S BW: serial-bandwidth of solutions, N: number of vertices.
Algorithms tested on one core of an Intel P8400 processor 
Memory Access Optimization for Bounded Bandwidth
In case of large problems it is possible, that the renumbered mesh has larger bandwidth than the available on-chip memory, these cases should be handled too. In this section we show an AM1 based method, which generate an input order which has at most a pre-specified serial-bandwidth. In the input order every vertex executed once, but can be loaded many times, so we need a flag Ex to store it is only a ghost node(false) or have to be calculated(true). Serial bandwidth in this job means the following: for every Ex=true vertex all of their neighbors surely be in the on-chip memory when the execution reaches them. If the defined bound is less than the serial-bandwidth for whole mesh provided by the AM1 method, the input will be k-times longer, where 1 ≤ k. The bound on bandwidth obviously have to be more than the maximal degree of the graph.
AM1 based bounded S BW method
The main concept of handling the bounded bandwidth is the usage of a proper serial bandwidth estimation, which is available in AM1 method. When a Part reaches the S BW bound, the process calculates which vertices can be executed, and call the AM1 method for the rest of vertices where Ex is not true. The main process starts new AM1 parts until all vertices are Ex=true. The output of the method is an access pattern(list of {index,Ex} pairs), where all vertex has true execute flag once, and can be appear many times as a ghost node.
Estimation of serial-bandwith: Given an AM1 Part, the task is to estimate its serial bandwidth.
AM1 estimates the part's bandwidth in each construction step. If i < I for node(i) in part P, it has all of its neighbors inside the part, so if the bandwidth is less than the bound when I become larger than k, node(k) can not increases S BW anymore. As shown earlier imp(i) is a lower bound on serial-bandwidth, but in the estimation a proper upper bound is required for the stopping condition.
Because in all step AM1 adds a node from u(I) to the part, we can calculate more than a proper upper bound for node(i), we can give the exact value.
S BW (i) = (n
Eq. 1 is equivalent to the definition of serial bandwidth 1 Equation 1 could be a stopping condition, but the proposed method has a less complex and useful upper bound for stopping decision defined in Eq. 2.
If Eq. 2 holds, AM1 continue to add nodes to part P, stop otherwise. This condition is not an upper bound for the whole part, but it provides in every step that node(I) has lower serial-bandwidth than the given bound. If I jumps to I' when AM1 adds node(n) to P, we can be sure that for all node(i) I ≤ i ≤ I 
Results and conclusions
It is obvious that the proposed algorithm generates access patterns which has lower S BW than a given bound. The input length multiplier k is a good parameter for measuring the solution quality.
(k-1)*100% of the vertices have to be reloaded from the main memory, but the processing still has 0% cache-miss 2 .
Measurements on three meshes with different S BW bounds can be found on Table 3 . The results shows that the solution quality mainly depends on the distance of the S BW bound and the maximal degree of the mesh. This is a really good news, because maximal degree is around 20-30 for a typical 3D mesh, while the S BW bound is around 10-40k 3 nowadays and increasing with each new generation of FPGA-s. The number of generated parts determine the time consumption of the proposed method, because in each restart of AM1 the algorithm calculates the pseudo diameter for the rest nodes. The results on 3d 015 shows that we can go below 25% of the original S BW with 15-30%
reload. This method gives the opportunity of deciding the size of the on-chip memory synthesized to the FPGA, so the designers can have more free area with sacrificing computational time.
2 assuming the handling of multiplicity problems 3 the bound is depending on sizeof(node) AM1 BW: the bandwidth provided by AM1 for the whole mesh overall length: length of the generated access pattern N: number of vertices, Algorithm tested on one core of an Intel P8400 processor
Arithmetic unit generation
The VHDL representation of the arithmetic unit optimized for speed and area was automatically generated via the framework presented at [7] . To reach high operating frequency the floating-point units shall be partitioned and a local control unit shall be assigned to every cluster. The objective is to minimize the number of extra FIFOs required for data synchronization between the clusters while guaranteeing that the signals of the control unit have tolerable fanout and do not decrease the operating frequency of the rest of the circuit. In [7] it was demonstrated that the placement of a partition generated by traditional partitioniers which minimize the previous objective is very challenging and the maximum operating frequency of the slowest floating-point unit cannot be reached with the standard Xilinx place and route tools. A new partitioning strategy was demonstrated which produces more cut nets than common strategies however the resulting partition can be easily mapped to the FPGA and the resulting circuit can operate near to the maximum operating frequency of the slowest floating-point units.
The main idea of the algorithm is to draw the graph into the plane before the partitioning starts. Position of specific parts of the circuit can be constrained by the Xilinx PlanAhead software. It enables the user to create rectangular placement constraints also called pblocks. The Xilinx P&R tools are likely to disperse the registers of the FIFOs which can limit the operating frequency of the circuit therefore separate pblocks were created for each FIFO buffer. As the Xilinx P&R tools were able to place and route the floating-point units inside the clusters no pblock were created for the floating-point units just for the clusters. The pblocks was placed manually using the positions of the placed vertices however in a future work this can be fully automatized if the area requirements of the clusters are also calculated.
Case study: Finite volume solver for the Euler equations
In this section we describe the computational modelling of compressible fluid and gas flows by some of the basic tools available in the field of Computational Fluid Dynamics (CFD). The art of CFD is heavily exploiting the enormous processing power available by recent computer technology. Indeed, its central concept is the approximation of the continuous model problem by a discrete one, requiring the processing and the manipulation of a huge amount of data.
Fluid Flows
A wide range of industrial processes and scientific phenomena involve gas or fluids flows over complex obstacles, e.g. air flow around vehicles and buildings, the flow of water in the oceans or liquid in
BioMEMS. In such applications the temporal evolution of non-ideal, compressible fluids is quite often modelled by the system of Navier-Stokes equations. The model is based on the fundamental laws of mass-, momentum-and energy conservation, including the dissipative effects of viscosity, diffusion and heat conduction. By neglecting all non-ideal processes and assuming adiabatic variations, we obtain the Euler equations [21, 22] , describing the dynamics of dissipation-free, inviscid, compressible fluids. They are a coupled set of nonlinear hyperbolic partial differential equations, in conservative form expressed as ∂ρ ∂t + ∇ · (ρv) = 0 (3a)
where t denotes time, ∇ is the Nabla operator, ρ is the density, u, v are the x-and y-component of the velocity vector v, respectively, p is the thermal pressure of the fluid,Î is the identity matrix, and E is the total energy density defined by
In equation (3d) the value of the ratio of specific heats is taken to be γ = 1.4. For later use we introduce the conservative state vector U = [ρ, ρu, ρv, E] T , the set of primitive variables P = [ρ, u, v, p] T and the speed of sound c = γp/ρ. It is also convenient to merge (3a), (3b) and (3c) into hyperbolic conservation law form in terms of U and the flux tensor
as:
Discretization of the governing equations
Logically structured arrangement of data is a convenient choice for the efficient operation of the FPGA based implementations [5] . However, structured data representation is not flexible for the spatial discretization of complex geometries. As one of the main innovative contributions of this paper, here we consider an unstructured, cell-centered representation of physical quantities. In the following paragraphs we describe the mesh geometry, the governing equations, and the main features of the numerical algorithm.
The geometry of the mesh
The computational domain Ω is composed of non-overlapping triangles. The i-th face of triangle T is labelled by f i . The normal vector of f i pointing outward T that is scaled by the length of the face is n i . The volume of triangle T is V T . Following the finite volume methodology, all components of the volume averaged quantities are stored at the mass center of the triangles. 
The Discretization Scheme
Application of the cell centered finite volume discretization method leads to the following semi-discrete form of governing equations (5) dU
where the summation is meant for all three faces of cell T , and F f is the flux tensor evaluated at face f . Let us consider face f in a coordinate frame attached to the face, such, that its x-axes is normal to f (see Fig. 5 ). Face f separates triangle L (left) and triangle R (right). In this case the F f · n f scalar product equals to the x-component of F (i.e. F x ) multiplied by the area of the face. In order to stabilize the solution procedure, artificial dissipation has to be introduced into the scheme. According to the standard procedure, this is achieved by replacing the physical flux tensor by the numerical flux function F N containing the dissipative stabilization term. A finite volume scheme is characterized by the evaluation of F N , which is the function of both U L and U R . In this paper we employ the simple and robust Lax-Friedrichs numerical flux function defined as
In the last equation F L =F x (U L ) and F R =F x (U R ) and notations |ū| and |c| represent the average value of the u velocity component and the speed of sound at an interface, respectively. The temporal derivative is dicretized by the first-order forward Euler method:
where U n T is the known value of the state vector at time level n, U n+1 T is the unknown value of the state vector at time level n + 1, and ∆t is the time step.
By working out the algebra described so far, leads to the discrete form of the governing equations to compute the numerical flux term F and the dissipation term D.
whereR n f is the rotation tensor describing the transformation from the normal-parallel coordinate frame of face f to the x − y frame. Quantity F f is defined in a coordinate frame attached to face f , with such an orientation that state left is identical to the state of the update triangle T while state right corresponds to the state of the triangle situated at the opposite side of the face. With these conventions, the normal component of the numerical flux function is given by
7 Results, performance
The generated arithmetic unit will be implemented on our AlphaData ADM-XRC-6T1 reconfigurable development system [23] equipped with a Xilinx Virtex-6 XC6VSX475T FPGA [24] and 2Gbyte onboard DRAM in four 32bit wide banks running on 8000MHz.
In our test case the cell centered approach is used therefore each triangle is represented by a node and each node has 3 neighbors except at the boundaries where ghost nodes are used to implement inflow, outflow or wall boundary conditions. For simplicity size and normal vector of all faces are precomputed which results in higher memory bandwidth requirement, but the arithmetic unit will be simpler. The architecture is implemented using both single and double precision floating point numbers. The node data structure contains four time dependent variables [ρ, ρu, ρv, E] and the area of in each clock cycle. In the double precision case this bandwidth cannot be provided on our prototyping board and the system will be memory bandwidth limited. It should be noted that computing face normals on the FPGA using the coordinates of the vertices results in about 30% smaller memory bandwidth requirement at a price of more complicated on-chip memory structure and arithmetic unit.
Results from the output of one processor can be feed directly to another processor linearly increasing the performance of the architecture without additional bandwidth requirement.
Test setup
To show the efficiency of our solution a complex test case was used, in which a Mach 3 flow over a forward facing step was computed. The simulated region is a two dimensional cut of a pipe which has 
Performance comparison
During a comparison various mesh sizes are used with 7,063 to 394,277 triangles. Performance of our architecture is estimated using the result of static timing analysis which is indicated 390MHz operating frequency in the double precision case. Three clock cycles are required to update the state of one triangle therefore performance of one processor is 130million triangle update/s. Computation of one triangle requires 213 floating point operations therefore performance of our architecture is 27.69GFLOPs. On the Virtex-6 XC6VSX475T FPGA three arithmetic units can be implemented resulting in 83.07GFLOPs cumulative computing performance.
Performance of our architecture is compared to a high performance Intel Xeon E5620 microprocessor running on 2.4GHz clock frequency. On the microprocessor a single core is used and the simulation is carried out with and without renumbering the nodes. Performance of the simulations is shown in Figure 8 . As we expected without renumbering the performance of the simulation is steadily decreasing as the mesh size is increased while renumbering the nodes improved the performance of the Reference solution for the previous problem computed by the more accurate residual distribution upwind scheme can be found in [26] .
Conclusions
A framework for accelerating the solution of PDEs using explicit unstructured finite volume discretization is presented. Irregular memory access patterns can be eliminated by using the proposed memory structure which results in higher available memory bandwidth and full utilization of the arithmetic unit. Efficient use of the on-chip memory is provided by a new node reordering algorithm which can be extended to generate fixed bandwidth partitions. The new algorithm is comparable to the well known GPS algorithm in both runtime and quality of the results.
Generation of the application specific arithmetic unit is described by using a complex numerical problem solving the Euler equations. The discretized state equations are automatically translated to a synthesizable VHDL description using Xilinx floating-point IP cores. Performance of the arithmetic unit is improved by using partitioning and local control units. Nodes of the arithmetic unit is rearranged to minimize edge crossing which helped placement of the partitions and improved clock frequency of the design.
Performance comparison of the architecture using a single processor running on 390MHz showed that 30 times speedup can be achieved compared to a high performance microprocessor core. Computing performance can be further improved by implementing three processors on one FPGA reaching 90 times speedup.
Currently size of the mesh is limited by the bandwidth of its adjacency matrix which must be smaller than 40,000. The architecture should be improved to efficiently handle multiple partitions and extended to use multiple FPGAs during computation.
