Abstract
Introduction
On-clup storage has become an essential component of high-density FPGAs. The large systems that will be implemented on these FPGAs often require storage; implementing this storage on-chip results in faster clock frequencies and lower system costs. W O implementations of onchip memory in FPGAs have emerged fine-grained and coarse-grained. In FPGAs employing fine-grained on-chip storage, such as the Xilinx 4000 FF'GAs, each lookup table can be configured as a small RAM, and these RAMS can be combined to implement larger user memories [I] . FPGAs employing the coarse-grained approach, on the other hand, contain large embedded arrays which are used to implement the storage parts of circuits. Examples of such devices are the Altera IOK, Apex, and Stratix devices [Z. 3 .41, the Xilinx Vinex and Virtex 11 FPGAs [5] .
the Actel 3200DX and SPGA parts [6. 71, and the Lattice ispLSI FPGAs [SI.
The coarse-grained approach results in significantly denser memory implementations, since the per-bit overhead is much smaller [91. Unfortunately, it also requires the FPGA vendor to partition the chip into memory and logic regions when the FPGA is designed. Since circuits have widely-varying memory requirements, this "averagecase" partitioning may result in poor device utilizations for logic-intensive or memory-intensive circuits. In panicular, if a circuit does not use all the available memory arrays to implement storage. the chip area devoted to the unused arrays is wasted.
This chip area need not be wasted, however, if the unused memory arrays are used to implement logic. Configuring the arrays as ROMs results in large multi-output lookup-tables that can very efficiently implement some logic circuits. In [IO] , a new tool, SMAP, was presented that packs as much circuit information as possible into the available memory arrays, and maps the rest of the circuit into four-input lookup-tables. It was shown that this technique results in extremely dense logic implementations for many circuits; not only is the chip area of the unused arrays not wasted, but it is used more efficiently than ifthe arrays were replaced by logic blocks. Thus, even customers that do not require storage can benefit from embedded memory arrays.
The effectiveness of this mapping technique, however, is very dependent on the architecture of the embedded memory arrays. If the arrays are too small, the amount of logic that can be packed into each will be small, while if the arrays are too large, much of each array will be unused. Previous studies have focused on the architecture of these memory resources when implementing storage [I L l 2 , 131. Since they are so effective at implementing logic, however, it is important that the design of the embedded memory arrays also consider this. In 1141, the the effects of array depth, width, and flexibility of memory arrays when they are used to implement logic were explored. That paper, however, only considered homogeneous memory architectures, ie. architectures in which each memory array is identical. In this paper, we show that significant density improvements are possible if the FPGA contains a hetemgeneous memory architecture, that is. an architecture with more than one size of memory array.
The goals of this paper are as follows:
1. The first goal is to quantify the density improvements that are possible with a heterogeneous memory architecture (compared to a homogeneous memory architecture) when used to implement logic.
2. There are many possible heterogeneous memory architectures (different array sizes, numbers, etc.). The second goal of this paper is to find the heterogeneous memory architecture that can most efficiently implement logic.
The architectural space explored in this paper is described in Section 2. Section 3 describes the experimental methodology and reviews the SMAP algonthm. Finally, Section 4 presents experimental results. Table 1 summarizes the parameters that define the FPGA embedded memory array architecture, along with values of these parameters for several commercial devices. In this paper we are considering architectures with two different array sizes: we denote the number of hits in each type of array as B1 and Bz. The number of each type of arrays is denoted N I and N I . We assume that all arrays have the same set of allowable data widths, and denote that set by wefp For a fixed size, a wider memory implies fewer memory words in each array. In the Altera FLEXIOK for example. B = 2048 bits, and weff = { 1,2,4,8). meaning each array can be configured to he one of 2048xl,1024x2. 512x4, or256x8.
Embedded Array Architectures

Methodology
To compare memory array architectures, we employed an experimental methodology in which we varied the various architectural parameters, and mapped a set of 28 of the sequential circuits were obtained from the Microelectronics Corporation of North Carolina (MCNC) benchmark suite, while the remaining sequential circuits were obtained from the University of Toronto and were the result of synthesis from VHDL and Verilog. All circuits were optimized using SIS [I 51 and mapped to four-input lookuptables using Flowmap and Flowpack [16] . The SMAP algorithm was then used to pack as much circuit information as possible into the available memory arrays. The number of nodes that can be packed to the available arrays is used as a metric to compare memory array architectures. The results in this paper depend heavily on the SMAP algorithm, which was originally developed for architectures in which all arrays are the same size. The following subsection reviews SMAP, while the subsequent subsrction shows how SMAP can be used to map logic to a heterogeneous memory architecture
Review of SMAP
This section briefly reviews S M A P for more details, see [IO] .
The SMAP algorithm is based on Flowpack. a postprocessing step of Flowmap [16] . Given a seed node, the algorithm finds the maximum-volume k-feasible cut, where k is the number of address inputs to each memory m y . A k-feasible cut is a set of no more than k nodes in the faninnetworkof the seed such that the the seed can be expressed entirely as a function of the k nodes; the maximum-volume k-feasible cut is the cut which contains the most nodes between the cut and the seed. The nodes that make up the cut become the memory array inputs. Figure I(a) shows an example circuit along with the the maximum 8-feasible cut for seed node A.
Given a seed node and a cut, SMAP then selects which nodes will become the memory array outputs. Any node that can be expressed as a function of the cut nodes is a potential memory array output. is an optimization problem, since different combination of outputs will lead to different numbers of nodes that can be packed into the arrays. In [IO] , a heuristic was presented; the outputs with the largest number of nodes in their maximum fanouf-free cone (maximumcone rooted at the potential output such that no node in the cone drives a node not in the cone) are selected. As shown in [IO] , those nodes in the maximum fanout-free cones of the outputs can be packed into the array. All other nodes in the network must be implemented using logic blocks. In Figure I (a), nodes C, A, and F are the selected outputs; Figure I@) shows the resulting circuit implementation.
Since the selection of the seed node is so important. we repeat the algorithm for each seed node, and choose the best results.
If there is more than one array available. we map to the first array as described above. Then, we remove the nodes implemented by that array, and repeat the entire algorithm for the second array. This is repeated for each available array.
Extension to Heterogeneous Memory Architectures
The SMAP algorithm was developed assuming a homogeneous memory architecture; that is, one in which each memory array is identical. Since the arrays are packed one at a time, the above algorithm can be applied directly to architectures with different sized memory arrays. The only issue is whether the large or small mays should be filled first. Experimentally, we have determined that the best results are obtained if we fill all of the large arrays first. The SMAP algorithm is greedy, in that, for each array, the largest portion of logic that can be mapped to the array is selected. Thus, the largest gains are likely to he obtained from the first few arrays that are filled; therefore it makes sense that these first few arrays are the large ones.
Results
Homogeneous Architecture Results
We first consider architectures in which all arrays are of the same size (this is the homogeneous case considered in [141). 
Heterogeneous Architecture Results
In this section, we consider architectures which contain two different sizes of memory arrays. Using the terminology of Section 2, each FPGA will have N I arrays of B1 bits each and NZ arrays of Bz bits each. We restrict our attention to architectures with three different ratios of N 1 : Nz:
1:L 1 2 , and 1:3. consider array sizes smaller than 128 bits, since such small arrays would not be suitable for implementing the memory parts of circuits, and thus, would not likely he considered by an FPGA manufacturer). The paclang density at this point is 23% higher than the best packing density obtained for homogeneous architectures.
We repeated the experiments for several values of NI and Nz; selected graphical results are shown in Figure 4 . In Figure 4 (a), oneof each type of array is assumed. In this case, the best architecture is a homogeneous architecture in which both arrays contain 2048 hits. This was the only configuration for which a homogeneous architecture was found to he the hest. Figure 4 (e) and (f). In both cases, the best architecture was found to consist of 2048 hit arrays and 128 hit arrays (this was the case for all architectures which we investigated, except the NI = Nz = 1 case as described above).
It is interesting to note that although an FPGA with both 128 hit arrays and 2048 hit arrays was found to be best. in some cases, (Figures 4(c) and (e)) the majority of the arrays should contain 2048 bits, while in other cases, the majority of the arrays shouldcontain 128 bits (Figures 4(d) and (0). This can be observed in the graphs by noticing that in Figures 4(c) and (e), the highest point is to the "left" of the center of the graph, while in Figure 4 (d) and (f), the highest point is to the "right" of the center of the graph. We have investigated other architectures with a N , : Nz ratio of 1 : 2 and 1 : 3, and have confirmed that. as the total number of arrays increases, the preference for smaller arrays increases. Intuitively, if there are more arrays, the SMAP tool is less able to effectively fill the larger arrays with logic.
A second conclusion that can be drawn from the results in Figure 4 (and confirmed by other experiments we have performed) is that as the total number of arrays increases, the advantage due to heterogeneous architectures (compared to homogeneous architectures) tends to increase. If there are only two arrays, a homogeneous architecture is better, while if there are 12 arrays (Figures 4(d) and (0) . the heterogeneous architecture is considerably better (22% better in each case).
Conclusions
Although embedded arrays in FPCAs were developed in order to implement on-chip storage, it is clear that these arrays can also be configured as ROMs and used to implement logic. In this paper, we have shown that significant density improvements are possible if the FPGA contains a heterogeneous memory architecture. that is. an architecture with more than one size of memory array. The amount of improvement depends on how many memory arrays are present; if there are eight arrays, we have shown that the hest heterogeneous architecture can implement logic 23% more efficiently than the best homogeneous architecture.
In virtually all cases, we have found that the best heterogeneous architecture consists of some 2048 bit arrays, and some 128 bit arrays. The exact number of each size of array depends on the total number of arrays available; the more arrays that are present, the larger the proponion that should be I28 hits.
We have also shown that the benefits of heterogeneous architectures become more significant as the number of arrays increase. This is a compelling argument for heterogeneous memory architectures. Future architectures are likely to contain more memory than they do now; FP-GAS with such large memory capacities would significantly benefit if a heterogeneous architecture is used. 
192
Figure 4 Other Selected Heterogeneous Architecture Results
