Many compute-intensive applications generate single result values by accessing clusters of nearby points in grids of one, two, or more dimensions. Often, the performance of FGPA implementations of such algorithms would improve if there were concurrent, non-interfering access to all points in each cluster. When clusters contain dozens of points and access patterns are irregular, multiported memories are infeasible and vector-oriented approaches are inapplicable. Instead, the grid points can be distributed across multiple interleaved memory banks so that, when accessing any cluster, each point comes from a different memory bank. We present a general technique based on the knowledge of the application's multidimensional indexing. This technique maps access clusters into a custom-interleaved memory using the FPGA's multiple on-chip RAMs and configurable data paths. Case studies examine rectangular and nonrectangular grids of different dimensionality, including performance vs. resource tradeoffs when cluster sizes are not powers of two. We also present a prototype tool for generating interleaved memories automatically from concise, application-specific definitions.
INTRODUCTION
FGPA accelerators are entering the main stream of high performance computing. Researchers have demonstrated impressive performance gains using FPGA acceleration, 100-1000× in some applications. These applications have generally been hand coded by skilled logic designers, and tend to be built as point solutions to particular problems or families of problems. Few techniques are available for the general task of applying FPGAs to arbitrary computing problems in ways that use to their full potential.
This paper presents a highly adaptable family of memory structures that can be applied to a wide range applications in FPGA-based computing. It is applicable to problems in solid modeling, computational chemistry, cellular automata, discrete solutions to differential equations, and many other staples of high-performance computing. The applications under consideration may have idiosyncratic computation pipelines, but share one common feature. Each output value depends on input from a cluster of neighboring grid points, where dimensionality of the grid and the neighborhood of points in a cluster vary by application. In the applications being considered, computation throughput would improve if the whole cluster could be fetched in one memory cycle.
The contributions of this work create a step by step process for mapping application-specific reference patterns to memory structures with application-specific interleaving. The resulting memory stores grid points non-redundantly in multiple RAMs, allowing single-cycle, parallel access to all points in any one cluster. This kind of interleaving takes advantage of the FPGA's unique resources. It uses dozens or hundreds of the independently addressable RAMs in the FPGA, it is configured for the unique characteristics of each different application, and it relies on the FPGA's native ability to create custom interconnection networks with data path widths into the thousands of bits. We also describe a prototype application that generates customized VHDL implementation and testbench code for custom interleaved memories, based on a small number of user inputs.
This technique for application-aware memory interleaving is described as follows. First, in section 2, we present several example applications with memory reference patterns of the kind being addressed. Section 3 briefly surveys related work in traditional memory structures and in FPGA-based computing. Next, in section 4, we present case studies of interleaved memories purposebuilt for applications with different patterns of memory access. These examples lead up to section 5, where we present a systematic design process that creates interleaved memories tailored to the specific needs the application. We conclude with a brief description of the prototype tools we have developed for creating interleaved memory structures.
MOTIVATING APPLICATIONS
Many compute-intensive applications share a common characteristic: each step in the computation depends on an * This work was supported in part by NIH award #RR020209-01 and was facilitated by donations from Xilinx Corporation.
access cluster of points close to each other, in fixed positions relative to each other, and always used together in the computation. This description covers many kinds of problem in one or more dimensions, including: ⋅ Molecular dynamics. Particle-mesh models of molecule behavior use grids to approximate the vector fields representing forces that act on the atoms. Atom positions rarely align to grid points at which forces are computed, so 3D interpolation is necessary. Tricubic interpolation has been found effective [13] , and requires 64 grid points in a 4×4×4 neighborhood around each atom. ⋅ Volume rendering. Ray-casting algorithms [15] traverse a 3D grid of values representing the optical characteristics at each point in a 3D field, such as opacity in a plume of smoke. Figure 1 shows how a ray samples the field, generally at off-grid points requiring 3D interpolation. This has also been found useful in three-axis rotation of volumetric models for in silico drug screening [14] . ⋅ Image processing. Morphological operators in two [7] and three [11] dimensions have long been staples in image processing, as have convolution kernels [6] . In either case, the kernel defines an access cluster for fetching points from the pixel grid being processed. Sparse kernels, containing don't-care patterns or zero coefficients, are common. They ignore pixel values in parts of the image region they cover, creating application-specific opportunities for optimization. These examples show just a few of the ways in which gridbased computations differ from each other. First, they differ in the number of dimensions of the grid. Second, they differ in the number and position of points in the access cluster. Third, they do not always allow regular, predictable order of access to successive clusters of points. Ray-casting, for example, may traverse the grid along any diagonal. Molecular dynamics commonly uses atoms having irregular distributions of position. Some systems for FPGA computation have been able to avoid memory bandwidth problems by heavy reuse of data fetched in tightly controlled sequences [12] . Unfortunately, that approach fails for applications that access data in unpredictable order.
Parallelizing any of these computations can benefit from accessing multiple points in one cluster together. A von Neumann architecture inherently degrades performance in these cases by requiring that the points be accessed sequentially even when used concurrently. Concurrent access to all points in the cluster would improve performance significantly in these applications.
RELATED WORK
Processor designers have long faced problems caused by CPUs that issue memory requests faster than they can be fulfilled by memory. Interleaving improves memory bandwidth by partitioning memory into separate banks able to process concurrent access requests. By 1968, SIMD processors including ILLIAC IV [1] already used interleaving for reading multiple words in one access cycle. Other implementations, including the CDC 6400 [5] , accessed one word per cycle. Each access required multiple cycles, however, so pipelining allowed concurrency by overlapping several multi-cycle latencies. Whether the implementation used broad parallelism as in the ILLIAC, pipelined parallelism as in the CDC 6400, or a combination, high performance depended on concurrent activity in multiple memory banks. Pipelined parallelism has been more widespread than broad parallelism because it is compatible with systems that have one memory bus. Broad parallelism demands multiple paths from memory to the processing elements, implying high hardware cost when implemented in traditional technologies.
Successful parallelism implied that no concurrent memory accesses would collide by requiring service from the same memory bank [4] . Clever techniques have been used to reduce the probability of collisions between concurrent memory accesses, such as prime numbers of memory banks [12] or pseudorandom mapping of memory addresses to memory banks [16] . These approaches implement fixed structures, without respect to particular applications. Their intent is to break up regularities in the data access pattern, and so reduce the probability of delaycausing collisions at any one memory bank.
Modern FPGAs support different premises for the design of high-performance memory systems: 1. Essentially no cost for high-order memory interleaving, since FPGAs already have hundreds of independent block RAMs on chip [1] [17]. 2. Essentially no cost for broad parallel access to on-chip RAMs, as long as fetched data is used by on-chip processing elements. The many memory ports and connectivity resources required for broad parallelism already exist in the FPGA fabric, waiting to be exploited. 3. Inherent configurability, so there is no hardware cost in creating memory structures unique to each application. 4. Application specificity -interleaving memory in terms of the application's indexing strategy, dimensionality, and idiosyncracies of memory access, not just the flattened memory addresses. Other authors have described tools for using multiple onchip memories for holding array data in FPGAs [10] . They differ from the current work in that they assign all elements of any one array to one RAM bank. That work complements the current approach, which partitions the words of a single array across multiple RAM banks. 
CASE STUDIES
These examples show few kinds of interleaved memories, demonstrating access clusters of different dimension, size, and geometric character. In each case, the goal is to create a memory subsystem that accepts one grid index as input, and outputs a cluster of grid points (A, B, C, …) drawn from different memory banks.
For purposed of this discussion, RAM banks are logical, not physical structures. Banks are implemented using at least one of the FPGA's block RAMs, or as many as necessary for the memory's required capacity. Figure 2 shows a point P misaligned to a 2D grid. The value of point P is to be computed using bilinear interpolation between points in the cluster (A, B, C, D). The interpolation itself is not of interest at present, only the access to the cluster of points needed at that step.
Bilinear interpolation
The access cluster consists of points (x, y), (x+1, y), (x, y+1), and (x+1, y+1), where point A is considered the origin of the cluster. The x and x+1 index pair includes one odd and one even index, and likewise for the y coordinates. Suppose that four RAM banks are available, numbered 0 to 3. Put each grid point into a bank according to the LSBs of the coordinates, as shown in Table 1 . In each cycle, the banks each produce one value needed for an access cluster, so the entire cluster can be accessed at one time. Two problems remain: generating addresses for each of the banks, and aligning the bank outputs into fixed order for the next stage of the computation.
In order to understand these problems, it helps to treat the RAM array as a tile that repeats across the entire grid of (x, y) index values, as shown in Figure 3 . The position of each RAM bank within the tile is fixed, but there is a different correspondence between the set of banks and the (A, B, C, D) tuple according to the position of the cluster relative to the tile. Figure 3 shows the positions of each RAM bank within the basic tile, and the image of each bank in tiles covering the grid. In this example, access cluster ABCD has its lower-left corner at grid coordinate (3,2), assuming 0-based indexing. The cluster aligns to the tile in the y direction, but straddles the tile border in the x direction, so that A corresponds to RAM bank 2, B to bank 3, C to 0, and D to 1. For each value within the cluster, the bank's address depends on which instance of the basic tile it touches, and differs if (as here) the cluster crosses tiles.
For convenience, the grid coordinate of the access cluster as a whole is taken to be the coordinate of its lowerleft corner. After extracting the least significant bits (LSBs) from the cluster's x and y coordinates, two-dimensional indexing is converted into a linear address in the usual way:
where Table 3 shows Cluster of points accessed to compute P using bilinear interpolation which bank provides each of the four outputs according to the even/odd states of the indices. Each row of Table 3 represents a multiplexer (or mux). The mux's output is one of the points in the cluster, A, B, C, or D. Data inputs to the mux come from one of the RAM banks, 0, 1, 2, or 3. Select control for the multiplexer comes from the combination of x and y LSBs.
This completes the design of the interleaving. Figure 4 shows its internal structure. The subsystem is indexed using the (x, y) coordinate that represents the position of the access cluster. Blocks labeled +1? in the Address generation section perform the conditional increment operations of Table 2 . The resulting index values are combined in various ways to generate the addresses to the RAM banks in the RAM array section. Each bank generates its output, which is passed to the Output alignment section that implements the logic of Table 3 .
This overall structure is used in every interleaved memory system described in this paper. When a new interleaving is created for a different application, only the following features differ: ⋅ Index variables. 
Tricubic interpolation
The application in this case study comes from molecular dynamics in [13] . That study represents potential fields acting on atoms as values stored in a grid, with 3D interpolation used to determine the field values at off-grid atoms positions Oberlin and Scheraga observe that cubic interpolation gives better results than linear interpolation.
Instead of a 2×2×2 access cluster of grid points around each atom, however, tricubic interpolation requires a 4×4×4 cluster of grid points. We examine the interleaving needed for maximum parallelism in this application, without considering the actual interpolation arithmetic 1 . This extends the first case study, bilinear interpolation, in two directions. First, it uses three index values (x, y, z) rather than two. Second, it requires four points in each axis for finding the cubic equation to interpolate, rather than the two needed for linear interpolation. Again, the interpolation arithmetic is not of interest in current discussion. We describe only the interleaved memory needed for fetching the full set of 4×4×4=64 values in each access cluster.
Since there are 64 values in each access cluster, interleaving requires that number of RAM banks to ensure non-interfering access to all values in the cluster. The cluster is a 4×4×4 cube, with the same indexing behavior in the x, y, and z axes. We examine the x axis first, then treat the other axes in analogous ways.
Each access cluster fetches values from indices x, x+1, x+2, and x+3. Examine the two LSBs: each possible value (0, 1, 2, and 3) occurs once in this group, so those LSBs are used for output multiplexing. The x' values are x index values stripped of their LSBs, as in the bilinear case, but this application strips two LSBs rather than one.
There are four possible mod 4 starting points for the access cluster, shown in Figure 5 . If the cluster starts at a 0 mod 4 boundary, then all four x' values are the same. If the cluster starts at a 1 mod 4 boundary, then (x+3)'=x'+1 for RAM bank 0, and so on as shown in Table 4. This table  corresponds to Table 2 , but shows one index only. The full 1 Alternative solutions exist, including one that divides the problem into four 4×4 bicubic interpolations to be performed sequentially. That requires access to only 16 data points at a time, with 16-way interleaving. Table 4 has 64 rows, one for each of the RAMs, and handles all combinations of x, y and z offsets mod 4.
The Output alignment section of Figure 4 is also implemented in this application. This time, however, Output alignment has 64 data inputs, 64 outputs, and 6 bits of selection control (two LSBs from each index). For implementation efficiency, this is implemented as three layers of 4:1 muxes, rather than a single 64:1 mux.
This application demonstrates the value of FPGAs in creating interleaved memory structures. First, FPGAs such as the Xilinx Virtex II family easily satisfy the need for 64 independently accessible memory busses. The VP100 chips in that family contain over 400 block RAMs, but even 64 memory busses would be costly in competing technologies. Second, if data words are 16 bits each, every cluster access fetches 64*16=1024 bits of data, a 1Kbit transfer. Again, the FPGA's connectivity resources handle this word size readily, but it would be expensive in many other implementation technologies.
Hexagonal grids
Non-rectangular, hexagonal grids arise in some problems. This presents two difficulties. First, the grid must be mapped to a rectangular structure for 2D array indexing. Figure 6B , shows one approach, for the cluster centered on grid cell 2b. The second problem is that the cluster dimensions are not powers of two. Instead, the bounding box is 3×3. This can be handled directly, at the cost of additional addressing logic and access time (see Sec. 5.3)
Another way to think of this cluster is as a 7-element subset of 4×4 RAM array. This makes address computation easy, but require 16 banks for accessing a cluster of 7 values. The additional RAM banks may, at first glance, look like 130% overhead in the RAM allocation. On the other hand, adding those RAMs reduces the amount of addressing logic, possibly including block multipliers, and so reduces the delay in the addressing logic path. Also, depending on the total capacity of the RAM array, those block RAMs could have been required anyway, in order to provide the total number of words required for the memory structure. When hundreds of block RAMs are available, it is often worth-while to allocate additional block RAMs in order to reduce address logic delays and to reduce consumption of other FPGA resources.
Assuming the 4×4 RAM array, implementation is now straightforward. Address generation logic resembles that used in the tricubic interpolation example, but simplified to two dimensions. The Output alignment section has 16 data inputs, 7 data outputs, and 4 selection bits (two LSBs each from the x and y indices).
DESIGNING THE INTERLEAVED STRUCTURE
Design of the interleaved memory proceeds in a regular sequence of steps: 1. Define the access cluster used by the application. In cases like the hexagonal grid example, this may require some effort in converting the grid to a rectangular form. 2. Round the cluster size up to the next power of two in each dimension. Allocate one RAM bank per rounded-up cluster element to the RAM array of Figure 4 . 3. Create the Address generation network following the examples leading to Table 2 and Table 4.  4. Create an output mapping table like Table 3 , with one output per element in the access cluster. Use that table to create the Output alignment section of the array.
Step 1 requires insight into the application's reference pattern, and is sometimes affected by partitioning decisions. After that, generation of the interleaved memory system follows readily. This procedure can be modified in many ways, to make better use of available resources or to take advantage of features unique to some model of FPGA.
Multiported RAM
Block RAMs in many FPGA families have two independently addressable read ports. These can be used, in some designs, to reduce the number of block RAMs needed for the RAM array of Figure 6 .
Consider the array of four RAM banks used in the bilinear interpolation case study of section 4. Each bank number in that example is constructed from the LSBs of the x and y grid coordinates, as shown in Table 1 .
This can also be implemented with two dual ported RAM banks, named 01 and 23. RAM addresses [x', y'] are built from the x and y values stripped of their LSBs, as before. The difference is that bank 01 holds grid points for all points with even y coordinates and bank 23 holds all 
Broad parallel writing
The discussion so far has assumed that grid computations are dominated by read access to stored grid values. Particlemesh (PM) models in molecular dynamics present an opportunity for broadly parallel write access, as well. One phase of the calculation updates the grid of Coulombic potential values by summing the spatial distributions of each particle's charge [9] . Each particle's charge affects some spatial region, so must update several points in the grid. This creates an opportunity for broad parallelism in the write operation. Parallel writing requires a second multiplexing network similar to the Output alignment of Figure 4 , but arranged to map data input clusters to RAM banks rather than banks to output.
Non-2 N RAM arrays
The case studies of section 4 refer to the LSBs of grid coordinates and to coordinate values stripped of their LSBs. These, of course, represent x mod k or x/k, where k is a power of two. That is a convenience only, not a restriction on the values that can be used for the RAM array. With some additional complexity in the address generation logic, non-2 N sizes of RAM array can be accommodated. For example, the case study on hexagonal grids defines an access cluster that fits a 3×3 bounding box, requiring 9 RAM banks. It rounds the RAM array up to 4×4, requiring 16 banks, potentially an increase of 78% in RAM allocation. As noted earlier, the additional RAM banks represent an addition of RAM hardware that reduces access times and address generation logic, and may have been needed anyway to reach the total capacity required of the memory structure. In other cases, time and resource tradeoffs might favor the following solution.
The addressing logic of Equation 1 is interpreted slightly differently for a 3×3 array. The N Y value becomes 3, x' becomes x/3, and y' becomes y/3. Direct division of x/3 in digital logic is inconvenient. Instead, the same net effect comes from multiplying x by 1 / 3 in fixed-point format, i.e. 0.0101… 2 . One Virtex-II 18×18 block multiplier can handle x× 1 / 3 with enough accuracy to cover indices to 17 bits, and multipliers can be ganged for wider index ranges.
The LSBs of x and y in fact represent x mod N X and y mod N Y , for a RAM array of size N X ×N Y , in the special case where both are whole powers of two. In the case of the 3×3 RAM array for the hexagonal cluster, the mod-3 residue is required for both indices. The mod-3 value of each nonzero position in the binary representation of a number is known to be 1 or 2. This allows the mod-3 remainder to be computed by summing the mod-3 values of each nonzero bit in the index and using a small lookup table to get the final result.
Given these changes of interpretation, the directions of section 5 can be used with one modification: step 2 does not round up. The steps in this example are: 1. Define the access cluster. In this example, the 7-point access cluster for the hexagonal grid is used. 2. Take the bounding box (3×3) of the access cluster to be the dimension of the RAM array. 3. Create the Address generation network that implements the logic of 
DESIGN TOOL IMPLEMENTATION
Section 5 of this paper gives an orderly procedure for creating a wide variety of memory structures, which differ in dimensionality, in shape of access cluster, and in other ways. This lends itself well to automation, so we have created a Java application for constructing VHDL implementations of interleaved memories. This is available in source form at our lab's web site, www.bu.edu/caadlab . Our prototype RamInterleave program uses command line input to describe the memory structure required. To show its operation, consider the access cluster of Figure 7 . The following inputs describe it fully: dataBits=12 output=north,0,1 output=mid,0,0 output=east,1,0 output=ee,2,0 axis=longitude axis=latitude name=testRam testSize=64,128 The current implementation of RamInterleave must fix the data word size when the VHDL code is generated, using the "dataBits" keyword. This example uses 2D indexing, so two "axis" keywords specify names for the indices. The "name" keyword specifies the name of the VHDL component. The access cluster's data points are assigned symbolic names and relative offsets by "output" keywords. The VHDL component generated uses generic port values to specify the X and Y extent of the 2D array, so no memory size needs to be specified for the component itself. The testbench, however, instantiates a memory of the specific size given by the "testSize" keyword.
The program generates three output files. The VHDL entity definition appears in output file testRam.vhdl, with a filename generated in the obvious way from the "name" input value. A VHDL package named testRam_def exports the component interface and auxiliary definitions in file testRam_def.vhdl. A third file contains testRam_test_driver, a top-level component named in a similarly named .vhdl file. That generates and verifies a test pattern in an instance of the memory structure. Test data is generated algorithmically, so there is no need for auxiliary files of test vectors.
The current RamInterleave implementation shows the feasibility of automated memory generation, but lacks many desirable features. These include broad-write capability, non-2 N arrays, pipelining, alternative implementations for multiplexing networks, output other than VHDL, and exploitation of features specific to particular FPGA platforms or synthesis tools. We are currently extending the tool in these and other areas. We are also evaluating additional memory-access needs of several applications, including bioinformatics, drug docking and molecular dynamics.
ACKNOWLEDGEMENTS
We thank anonymous reviewers for suggestions that improved the clarity of this presentation.
