Computing applications in FPGAs are commonly built from repetitive structures of computing and/or memory elements. In many cases, application performance depends on the degree of parallelism -ideally, the most that will fit into the fabric of the FPGA being used. Several factors complicate determination of the largest structure that will fit the FPGA: arrays that grow nonlinearly and in uneven step sizes, coupled structures that grow in different polynomial order, multiple design parameters controlling different aspects of the computing structure, and interlocked usage of different hardware resources. Combined with resource usage that depends on application-specific data elements and arithmetic details, these factors defeat any simple approach for scaling the computing structures up to the FPGA's capacity. We present a formal analysis of maximizing FPGA utilization, with adaptations that simplify the optimization problem. We also report on design tools containing extensions that support automated sizing of FPGA-based computation arrays.
INTRODUCTION
Recent product announcements by Cray [2] and Silicon Graphics, Inc. [12] show that FPGA acceleration is entering the main stream of performance computing. Tools for developing FPGA-based application accelerators are still oriented toward traditional logic rather than computation, however, leading to the observation that "10×-100× of performance ... has been at the cost of 10×-100× increase in difficulty in application development." [6] Much of this difficulty comes from the need to manage an FPGA's biggest advantage, its inherently massive, fine grained parallelism. In many applications, however, well understood repetitive structures such as systolic arrays can be used to manage the parallelism [15] [16] . These regular arrays offer the possibility of effectively unbounded growth of computation arrays as FPGA sizes increase, with the promise of increased application throughput or support for problems of increased size.
One challenge to tools for FPGA-based computation lies in maximizing the parallelism available to the application, i.e. determining the size of the largest possible computation array that fits into the FPGA. Three factors determine the array size: 1) resource limits set by the FPGA hardware, 2) resources needed for each data or computation element in the array, and 3) application specific constraints on the numbers of elements allowed in the processing array.
Contributions in this paper address automated maximization of FPGA-based computation arrays, both as a theoretical problem in optimization and as practical implementation in prototype design tools. This requires handling of nonlinear scaling laws in computation arrays of interest, multiple different growth-limiting resources in the FPGA (including logic, RAM, and hardware multipliers), and customization of the arrays for different applications.
The remainder of this introduction presents the idea of application families, or reusable communication structures that can be instantiated with different data elements and with different inner components. Section 2 briefly reviews related work in optimization and in design space exploration (DSE). It will be seen that the current problem has different goals than previous DSE research, but that many valuable ideas from DSE are applicable to automated sizing of computation arrays. Section 3 describes the problem of optimizing the computation array in detail, with examples of FGPA and application complications that must be addressed. This section ends with a formal statement of array sizing as a multidimensional optimization problem. Section 4 presents one solution to the array sizing problem, as implemented in the prototype LAMP tool set [13] [14] . This includes language primitives for heuristic resource estimation, and guidelines for implementing the optimization formalisms in terms of LAMP language features. We show how this addresses the reusable portion of the application accelerator, the portion unique to each instance of the accelerator, and the idiosyncrasies of each FPGA, with respect to allocation of its computing resources. Finally, section 5 describes factors not addressed in the current implementation, suggests ways in which the current LAMP strategy may be strengthened.
Application Families
The problem of scaling an array of processing elements 1 (PEs) arises when creating computation accelerators for families of applications. A family of applications is defined by a reusable computing pipeline and communication structure, within which the applicationspecific data type and leaf computations are free to vary.
One example of an application family addresses approximate string matching [15] . In that case study, the character strings may be DNA with a four-letter (two bit) alphabet, protein sequences built from a twenty-letter alphabet (five bits each), codons (six bits), or any other type the user chooses. Character comparison functions also vary according to application choice: exact equality tests, wildcard matches, or a function returning graded goodnessof-match scores. Another application family arises in generalized 3D correlations used for screening molecules as potential drug leads [16] . In this case, the number of bits in voxel data values and the amount of logic needed for scoring functions vary widely according to the kinds of chemical interactions being modeled. In both the string matching and generalized correlation, the structure of the systolic array is held constant across all members of the application family, but the number of array elements depends on the amount of logic required for each element.
In cases like these, larger arrays support higher parallelism, so support higher throughput or larger problem sizes. One design goal, therefore, is to implement the biggest array possible for a given problem. Figure 1 illustrates how the FPGA's fine-grained resource allocation trades off amounts of logic per PE against numbers of PEs in the computation array. Each member of each application family allocates only as many gates as needed for its data elements and PEs. As a result, family members with lower resource allocation per PE can generate more PEs, allowing larger computations to be performed. Likewise, an increase 1 The conventional term "processing element" is extended to cover any of the repeatable elements in a processing array, including memory units that do not actually perform any arithmetic or logical operations.
in FPGA capacity creates an opportunity for more PEs, more parallelism, and higher application performance.
RELATED WORK
Standard FPGA design tools attempt to increase FPGA resource utilization through better placement and routing choices [3] or through new algorithms for choosing between available resources within an FPGA [8] . In these cases, the design tools perform low-level tradeoffs between resource instances or resource types in order to implement a fixed logic design. These optimizations sometimes replicate logic structures in order to meet timing requirements or to improve performance. This kind of replication, however, does not change the apparent degree of parallelism in the system. Array sizing, however, changes the number of PEs in visible ways, in order to increase the size or speed of the computation. Design space exploration (DSE) is another family of approaches to increasing FPGA utilization. This generally considers an application of strictly defined function, and tries multiple implementations of the application subsystems. In some forms, DSE proposes alternative implementations of a fixed system definition with different space-time tradeoffs [3] , or with maximum speed [5] .
Another DSE approach seeks to reduce hardware costs through temporal partitioning of one algorithm into multiple logic patterns for one FPGA, subject to the constraint that the FPGA computation and reload times meet stated timing constraints [11] .
DSE differs from the current study in one major respect: it holds the application to perform constant. Various DSE approaches then try to reduce hardware cost or improve other performance criteria, possibly within some performance constraint. This is in contrast to the current work, where a single parameterized implementation is used, but the number of PEs can vary up to the capacity of the chosen FPGA.
INPUTS TO THE OPTIMIZATION PROBLEM
Whatever the computation array and FPGA capacity, the accelerator designer generally wants one thing that no current design tools are able to state explicitly: as many PEs as possible. This indefinite number depends on the resource utilization per PE, permissible sizes for computation and memory arrays, and FPGA capacity.
Three major sources of information affect an application accelerator's implementation. First, the choice of FPGA specifies the available amounts of each computing resource. Second, resource utilization specific to a particular member of the application family specifies the amount of FPGA fabric needed for each PE. Third, the scaling law for an application states the numbers of PEs in the computing array, which grow by irregular step sizes. The remainder of this section discusses how those factors combine to define an application-specific accelerator.
FPGA resources
The FPGA resources include the programmable logic, hardware multipliers, block RAMs, and other features accessible to the logic designer. A larger FPGA in a given product family contains more of some or all resources, potentially allowing a larger computation array for a given application's accelerator. The resources of interest are expected to differ between applications or application families; one family member may require hardware multipliers where another does not, for example. Care must be taken in creating the resource abstraction since some resources are available only in specific quanta, such as block sizes for RAM bits. Extra care must be taken when the abstraction must cross FPGA product lines, since resources from different vendors are not always directly comparable. Block RAMS typify resource differences between vendors: the Xilinx Virtex-II Pro products contain 18Kb block RAMs, but comparable Altera Stratix-II chips offer a combination of 512b, 4Kb, and 512Kb RAMs.
Application-specific resource usage
The case study of [16] shows how different members of an application family differ in their consumption of FPGA resources. It uses a generalized 3D correlation of the form:
The bit-width of score S, data types of voxels a and b, and the logic of function F differ in each member of the application family, in order to represent the chemical phenomena modeled by that family member. As a result, the numbers of data bits and the size of each PE in the computing array differ between family members, which in turn affects the amount of logic needed for instantiating each PE. The computation array also includes FIFOs for holding intermediate results. Both the word size of S and the dimension of the correlation array affect the number of bits in each FIFO, and therefore affect the number of FPGA block RAMs needed for implementing the FIFOs.
Growth laws in computing arrays
A growth law is the set of arithmetic rules defining the allowable sizes of PE arrays, in terms of one or more structural parameters. Structural parameters are numerical design parameters that control the sizes of computation arrays, but are not necessarily exact numbers of PEs. Figure 2 illustrates growth laws for several kinds of computation array. Figure 2A , a linear array, is the simplest. It allows an array of size N for any positive integer value of the structural parameter N. If, as in Figure 1B , an array has arbitrary rectangular shape, then there are two different structural parameters, N 1 and N 2 , giving the dimensions of the array. In this case, the numbers of PEs would increase in quanta of whole rows or columns. Figure 1D demonstrates an exponential rather than polynomial growth law, as just one example of growth laws of arbitrary complexity. In such a structure, the number of PEs would be constrained to values 2 N -1, for integer values of structural parameter N. Figure 2C illustrates multiple coupled structures, possibly representing a cubical computation array, a square array for aggregating values from rows in the cube, and a linear array that aggregates values from columns of the square array. Clearly, the sizes of all three sub-structures are coupled through the one structural parameter N. Each of the three sub-structures is assumed to contain different kinds of PEs with different resource requirements. The growth law is represented by a polynomial with separate terms for each of the inter-related structures:
, where the constants k i capture the different resource demands of each sub-structure. Figure 2E shows a linear structure like that in 2A, but with the added complication of consuming two different FPGA resources: logic and RAM. Depending on the specific FPGA's resource availability and the application's resource consumption, either of the two resources could be the one that limits the size of the array.
Of course, all of these features can occur in the growth law for any one system. One computing array can involve multiple structural parameters describing nonlinear relationships between coupled sub-structures, with interlocking terms for different types of FPGA resource.
Array growth -not just loop unrolling
Loop unrolling is a standard compilation technique, both in software compilation and in synthesis from high level languages [1] . It could, in principle, be one automated approach to expanding a linear structure up to the capacity of the FPGA, even though there do not appear to be published reports of this having been done. The factors illustrated in Figure 2 show why loop unrolling is inadequate in the general case of array sizing. The first problem is that linear techniques for loop unrolling work badly on non-linear computing structures such as cubes or trees. The second problem, suggested by Figure  2B , occurs when multiple loops are candidates for unrolling, with no clear way to decide how much unrolling should be performed along each axis.
A third problem is illustrated in Figure 2C . Typical design decomposition will break the compound structure into multiple communicating components, in any of several meaningful ways. One designer might separate the three substructures into separate components; another designer might group a plane of the cube with a column of the 2D array and a cell of the linear array. In either case, the structural parameter creates a coupling between multiple design components. Typical loop analysis operates locally, within a single component. Total resource allocation, however, depends on global analysis of multiple, coupled design components.
Optimizing the computation array
Equation (2) summarizes the problem of optimizing the computing array for a given member of some application family, subject to the constraints of a specified FPGA.
Symbols in Equation (
2) have the following meanings, defined by the FPGA, the application family, or the specific member of the application family being instantiated:
N opt defines the most desirable configuration possible, or one of the configurations tied for most desirable N = (n 1 , n 2 , …n I ) is the set of structural parameters that define an accelerator configuration, a tuple of positive integer values. These parameters and their meanings are defined by the application family. It is worth noting that this approach inverts the usual sense of the structural parameters. In standard design methodologies, they are design inputs; in this usage, they are consequences of other design decisions.
U (N, B) are synthesis estimation functions that state the consumption of FPGA resource j, given structural parameter values N and application-specific usage coefficients B. They determine the amount of each FPGA resource needed for a given member of an application family at a given size of computation arrays. Equation (2) state that the most desirable accelerator is the largest one that fits the FPGA-specific resources, once the application family growth laws and application-specific usage demands have been specified. Maximizing U(N) for configuration parameters N is, in general, a difficult problem, the exact solution of which is beyond scope of this discussion. Because realistic parameter sets N have modest numbers of structural parameters and modest integer ranges, exhaustive search of the configuration space assumed to be acceptable.
Simplifying assumptions
Without loss of generality, predicate V and functions U and S j are assumed to be monotonic in the following senses. Let N and N' be tuples of structural parameters, such that N' is identical to N in all positions except one where n i < n i '.
Predicate V(N) is said to be monotonic if ¬V(N) ⇒ ¬V(N').
Holding all other n j≠i constant, that means that there is some limit for n i below which all configurations are valid and above which they are not.
It is also assumed that U(N') ≥ U(N), i.e. that the value of an accelerator increases monotonically in all components (and subsets of components) of N. Functions S j are assumed monotonic in the same sense as U, for any given application family member characterized by some fixed B.
These constraints are not necessary for Equation (2) to be meaningful. There is, however, intuitive appeal in the idea that larger n i represent larger computation structures, and so have utility U at least as high. There is also appeal in the intuition that larger structures consume at least as much of each FPGA resource, according to functions S j . The real reason for monotonicity is pragmatic, however. Monotonic objective functions U(N) are far easier to maximize than non-monotonic functions. Monotonicity of S j (N, B) helps in limiting the number of configurations examined. Once some resource limit is exceeded, it is no longer necessary to continue examining configurations with larger values of n i , since larger accelerators would consume at least as much of any resource j, and would continue to violate resource constraints. Also, n i being positive and predicate V(N) being monotonic take the place of some alternative mechanism for setting lower and upper bounds on values for structural parameters, i.e. for limits to the parameter space to be searched.
LAMP IMPLEMENTATION
Equation 2 provides an analytic model for exploring the space of array sizes, but that optimization can not be expressed in current hardware description languages. Experience suggests that function U and predicate V often have convenient representation in closed form, but that resource estimation functions S j are complicated because they incorporate knowledge of many design components. Values B, representing resource utilization of any given application family member, depend on the numbers of bits in the data values and on the complexity of the member's unique logic elements. No existing hardware design languages make these values available explicitly.
The Logic Architecture by Model Parameterization (LAMP) tool set [13] supports parameter search using two basic mechanisms: heuristic leaf estimation, and extensions to the underlying HDL on which LAMP operates.
LAMP tools acknowledge that an FPGA based accelerator involves at least two developers with different skills, design responsibilities, and preferred programming tools: a logic designer who creates the communication and pipeline structure, and an application specialist who uses the accelerated computation. The logic designer uses a standard HDL such as VHDL, with XML-based markup language (LAMPML) to parameterize the leaf functions and data types in the design. Standard hierarchical logic design already creates a tree structure. The hierarchical root is the outermost design component; branches and subbranches represent instances of components and of their inner components. LAMP allows the logic designer to overlay the call graphs of resource estimation functions onto this design hierarchy. The logic designer uses LAMP markup to define functions in each component that estimate the usage of each resource for a given set of structural parameters. That estimation function is defined in terms of estimation functions exported by the inner components, numbers of component instances, knowledge of the logic structures outside of LAMP control, and leaf resource estimates. Each component uses only local knowledge of its own structure and of symbols exported by its immediate inner components, as recommended by the Law of Demeter [9] , but the recursive structure makes a global resource estimate available at the root level.
The application specialist defines application-specific data types (e.g. the kinds of characters used in string comparison) and leaf functions (e.g. the function that rates quality of match between characters). These type and function definitions are written in a C-like syntax unique to LAMP, and are coupled to LAMP markup in the HDL portion of the accelerator design. Since these definitions are written in LAMPML they are accessible to the LAMPML language processor, and in particular to LAMP's resource estimation heuristics.
LAMPML provides two related primitives for resource estimation:
(1) n = synthSize(typeName) and (2) n = synthSize(fctn, typeName, …). The first form of synthSize returns the number of bits allocated to the data type named, somewhat the way the C language's sizeof operator works on types or values. The second form of synthSize estimates the number of logic elements needed to implement function fctn. Since LAMPML supports polymorphic function definitions, the caller must provide parameter type declarations to disambiguate the function implementation.
LAMPML provides a second level of polymorphism at the level of a HDL design component. A FIFO, for example, may be instantiated repeatedly with different data types of different widths. The synthSize function works with that type parameterization, so that form (1) of synthSize returns a value appropriate to the actual type bound to symbol typeName in each different instance of the component in which it occurs. Also, because LAMPML can treat function implementations as parameters, form (2) of synthSize returns potentially different values according to the bindings of its typeName parameters and depending on the local binding of actual function definition to symbol fctn.
Exact resource utilization figures can only come from actual synthesis, so LAMP uses conservative heuristics to estimate the numbers of logic elements needed for a function. The synthSize function is part of the LAMP language tool, but is loosely coupled to other parts of LAMP logic. That makes it easy to modify the synthSize logic as improved estimation techniques become available.
Additional input to LAMP, using the same LAMPML notation, describes the FPGA itself. This provides another place in which resource estimation functions can be defined. FPGA-dependent estimation functions are especially helpful for estimating block RAM utilization, which depends not only on the number of bits to be stored but also on the word width. As an example, consider a RAM logically organized as 128 words of 128 bits each. Block RAMs in the Xilinx Virtex 2 family contain 18Kb, in word widths to 36 bits. Even though one of those block RAMs holds enough bits to contain the 128×128 bits, it take four Virtex block RAMs to implement the full word width. Block RAM sizes and supported word widths are different in different models of FPGA, so device-specific estimation functions help in retargeting of the applicationspecific part of an accelerator without changes to the code representing the application logic.
Referring to Equation (2), it can be seen that the synthSize function fills the role of application-specific usage values B. LAMPML functions embedded in each logic component support recursive construction of estimation functions S j . LAMP input representing FPGA specifics defines the constants R F for the resources in FPGA F. Validity test V and utility scoring function U can be written in LAMPML with the same notation used for the other functions described above.
Case studies have shown that a modest number of structural parameters N describe many useful accelerator structures, and that useful values of structural parameters tend not to exceed a few hundred or a few thousand. Given the simplifying assumptions of section 3.6, exhaustive search of the parameter space is a practical approach to solving for N opt , the most desirable accelerator configuration, in Equation (2) .
Thus, LAMP primitives and programming features support automated array sizing in terms of the application family, details of the application family member, and FPGA-specific resources.
CONCLUSION AND FUTURE DIRECTIONS
Automated sizing of computation arrays is not addressed in current FPGA design tools, since they assume logic designs of fixed function rather than open-ended tasks in computation. As a result, existing tools do not support the optimization problem of selecting the largest computing array that will fit a given FPGA's resources. This paper presents a conceptual framework and a prototype design tool for performing that optimization automatically.
Synthesis estimation is not a solved problem, however, and the quality of LAMP's synthesis estimation affects its efficiency in allocating FPGA resources. Future versions of LAMP will improve the synthesis estimates. Improvements will come from better heuristics, but may also involve feedback between the synthesis tools and the sizing logic. Techniques now used in design space exploration may also be helpful, especially when different implementations trade off different FPGA resources, for example random logic vs. lookup RAMs or dedicated arithmetic units.
