FPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the memory sizes and placement on the architecture. Nonetheless, we show that an energybalanced design for FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can guarantee that the energy is always within a factor of 2 of the optimally-matched architecture. On a combination of the VTR 7 benchmarks and a set of tunable benchmarks, we show that an architecture with internallybanked 8Kb and 256Kb memory blocks has a 31% worstcase energy overhead (8% geomean). In contrast, monolithic 16Kb memories (comparable to 18Kb and 20Kb memories used in commercial FPGAs) have a 147% worst-case energy overhead (24% geomean). Furthermore, on benchmarks where we can tune the parallelism in the implementation to improve energy (FFT, Matrix-Multiply, GMM, Sort, Window Filter), we show that we can reduce the energy overhead by another 13% (25% for the geomean).
INTRODUCTION
Energy consumption is a key design limiter in many of today's systems. Mobile devices must make the most of limited energy storage in batteries. Limits on voltage scaling mean that even wired systems are often limited by power density. Reducing energy per operation can increase the performance delivered within a limited power envelope.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Memory energy can be a significant energy component in computing systems, including FPGAs. This is particularly true when we assess the total cost of memories, including interconnect energy on wire segments that carry data to distant and distributed memory blocks or to off-chip memory.
This leads to an important architectural question: How do we organize memories in FPGAs to minimize the energy required for a computation? We have several choices: What are the sizes of memory blocks? Where (how frequently) are memory blocks placed in the FPGA? How are they activated? How are they decomposed into sub-block banks? Then, there are choices available to the RTL mapping flow: When mapping a logical memory to multiple blocks, should they each get a subset of the data width and be activated simultaneously? or should they each get a subset of the address range and be activated exclusively? In this paper, we first develop simple analytic relations to reason about these choices (Section 2). After reviewing some background in Section 3, we describe our methodology in Section 4. In Section 5, we perform an empirical, benchmark-based exploration to identify the most energy-efficient organization for memories in FPGAs and quantify the trade-offs between area-and energy-optimized mappings.
Section 6 takes the study from Section 5 one step further. For high-level tasks, we are not stuck with a single memory organization-the choice of parallelism in the design impacts the memory organization needed and, consequently, the total energy for the computation. A highly serial design might build a single processing element (PE) and store data in a single large memory; whereas a more parallel version would use multiple PEs and multiple, smaller memory blocks. The parallel version would then have lower memory energy, but may spend more energy on routing. Consequently, there is additional leverage to improve energy by selecting the appropriate level of parallelism. In Section 6, we explore how this selection allows energy savings and how it further drives the selection of energy-efficient FPGA memory architectures.
Contributions:
• Analytic characterization of the energy overhead that results from mismatches between the logical memory organization needed by a task and the physical mem 
ARCHITECTURE MISMATCH ENERGY
FPGA embedded memories generally improve area-and energy-efficiency [10] . When it perfectly matches the size and organization needed by the application, an FPGA embedded memory can be as energy-efficient as the same memory in a custom ASIC. Nonetheless, the FPGA has a fixedsize memory that is often mismatched with the task, and this mismatch can be a source of energy overhead.
First, let us consider just the memory itself. Memory energy arises almost directly from the energy to charge wire capacitance, which grows as the side length of the memory; that is, in an energy-minimizing layout, a memory block will roughly be square and the length of address lines, bit lines, and output wires grow as the square root of the memory capacity. A memory that is four times as large will require twice the energy. Therefore, when the FPGA memory block (M arch ) is larger than the application memory (Mapp), there is an energy overhead that arises directly from reading from a memory bank that is too large (E(M arch )/E(Mapp)).
There is also a mismatch overhead when the memory block is smaller than the application memory. To understand this, we must also consider the routing segments needed to link up the smaller memory banks into a larger memory bank. To build a larger bank, we take a number of memory banks (⌈Mapp/M arch ⌉) and wire them together, with some additional logic, to behave as the desired application memory block. In modern FPGAs it is common to arrange the memory blocks into periodic columns within the FPGA logic fabric (See Fig. 1 ). Assuming square memory and logic blocks, the set of smaller memory blocks used to realize the large memory block would roughly be organized into a square of side length Mapp/M arch , demanding that each address bit and data line connected to the memory cross roughly dm × Mapp/M arch horizontal interconnect segments to address the memory, where dm is the distance between memory columns in the FPGA architecture. If Eseg is the energy to cross a length-1 segment over a logic island in the FPGA and Emseg(M arch ) is the energy to cross a length-1 segment over a memory block of capacity M arch , the horizontal and vertical routing energy to reach across the memory is:
Since the routing energy of wires comprises most of the energy in a memory read, and since each bit must travel the height of the memory block (bit lines) and the width (output select), per bit, the energy of a memory read is roughly the energy of the wires crossing it:
Therefore:
This gives us the following mismatch ratio, driven by the ratio of the energy for routing between memory banks to the energy for routing over memory banks:
To illustrate the mismatch effects, Fig. 3a shows the result of an experiment where we quantify how the energy compares between various matched and mismatched designs. Each of the curves represents a single-processing-element matrix-multiply design that uses a single memory size; the size of the memory varies with the size of the matrices being multiplied. Each curve shows the energy mismatch ratio (Yaxis) between the energy required on a particular memory block size (X-axis) and the energy required at the energyminimizing block size (typically the matched size); hence all curves go to 1.0 at one memory block size and increase away from that point. In contrast to the previous paragraph where we used deliberately simplified approximations to provide intuition, Fig. 3a is based on energy from placedand-routed designs using tools and models detailed in the following sections; Fig. 3 also makes no a priori assumption about large memory mapping, allowing VTR [15] to place memories to minimize wiring. The figure shows how the energy mismatch ratio grows when the memory block size is larger or smaller than the matched memory block size. In practice, designs typically demand a mix of memory sizes, making it even harder to pick a single size that is good for all the memory needs of an application. Nonetheless, this single-memory size experiment is useful in understanding how each of the mismatched memories will contribute to the total memory energy overhead in a heterogeneous memory application.
There is also a potential energy overhead due to a mismatch in memory placement. Assuming we accept a columnoriented memory model, this can be stated as a mismatch between the appropriate spacing of memories for the application (dm app ) and the spacing provided by the architecture (dm arch ). If the memories are too frequent, non-memory routes may become longer due to the need to route over unused memories. If the memories are not placed frequently enough, the logic may need to be spread out, effectively forcing routes to be longer as they run over unused logic clusters. This gives rise to a mismatch ratio:
Note that if we make dm arch Eseg = Emseg(M arch ), the mismatch ratio due to route mismatch (Eq. 5) is never greater than 2×. Similarly, the mismatch ratio due to memories being too small (Eq. 4) is never greater than 1.5×. We can observe this phenomenon in Fig. 3a by looking at the 32Kb memory size that never has an overhead greater than 1.2×. In Fig. 3 , we also identify the dm arch that minimizes maxoverhead (shown between square brackets for each memory size in Fig. 3) . This approximately corresponds to the intuitive explanation above, where the energy for routing across memories is balanced with the energy for routing across For this 32Kb case, we found dm arch = 2 experimentally. Since segment energy is driven by wire length, dm arch Eseg = Emseg(M arch ) roughly means dm arch Lseg = Lmseg(M arch ); when we populate memories this way, half the FPGA area is in memory blocks. This design point gives us an energy-balanced FPGA that makes no a priori assumptions about the mix of logic and memory in the design. In contrast, today's typical commercial FPGAs could be considered logic-rich, making sure the energy (and area) impact of added memories is small on designs that do not use memories heavily.
While the dm arch Eseg = Emseg(M arch ) balance can limit the overhead when the memories are too small, we can still have large overhead when the memory blocks are too large (E(M arch )/E(Mapp)). One way to combat this problem is to use internal banking, or Continuous Hierarchy Memories (CHM) [8] : We can bank the memory blocks internally so that we do not pay for the full cost of a large memory block when we only need a small one. For example, if we cut the memory block into four, quarter-sized memory banks, and only use the memory bank closest to the routing fabric when the application only uses one fourth (or less) of the memory capacity, we only pay the memory energy of the smaller memory bank (See Fig. 2 ). In the extreme, we might recursively decompose the memory by powers-of-two so that we are never required to use a memory more than twice the size of the memory demanded by the application. There are some overheads for this banking which may suggest stopping short of this extreme. Fig. 3b performs the same experiment as Fig. 3a , except with memory blocks that can be decomposed into one-quarter and one-sixteenth capacity sub-banks. With this optimization, the curves flatten out for larger memory sizes. The physical size with smallest max-overhead is now shifted to 128Kb, still at 1.2×.
Another way to reduce the impact of memory block size mismatch is to include memory blocks of multiple sizes in the architecture. This way, the design can use the smallest memory block that will support the application memory. For example, if we had both 1Kb and 64Kb memories, we could map the 2Kb and smaller application memories to the 1Kb memory block and the 4Kb and larger application memories to the 64Kb block and reduce the worst-case overhead to 1.1× (Fig. 3a) . However, this raises an even bigger question
16 [2] 32 [2] 64 [3] 128 [3] 256 [7] 512 [7] 
Normalized Total Energy In particular, routes may now need to pass over the memory blocks of the unused size. We can generalize the previous observation about balancing routing over memories and logic to:
However, since there are now three different resource types, in the worst-case, a route could need 3× the energy of the optimally-matched architecture instead of 2× when there were only two resource types (logic and one memory size). Another point of mismatch between architecture and application is the width of the data read or written from the memory block. Memory energy also scales with the datawidth. In particular, energizing twice as many bit lines costs roughly twice the energy. While FPGA memory blocks can be configured to supply less data than the maximum width, this is typically implemented by multiplexing the wider data down to smaller data after reading the full width-the same number of bit lines are energized as the maximum width case, so these smaller data reads are just as expensive as the maximum width read, and hence more expensive than they could have been with an optimally-configured memory. While width mismatch is another, important source of mismatch, it is beyond the scope of this paper. We stick to a single raw data-width of 32 bits throughout our experiments.
Another potential point of mismatch is the simultaneous ports provided by the memories. We assume all memories are dual-ported (2 read/write ports) throughout this paper.
BACKGROUND

FPGA Memory Architecture
We build on the standard Island-Style FPGA model [3] . The basic logic tile is a cluster of K-LUTs with a local crossbar providing connectivity within the cluster (c.f. Xilinx CLB, Altera LAB). These clusters are arranged in a regular mesh and connected by segmented routing channels.
To incorporate memories into this mesh, we follow the model used by VTR [15] , Xilinx, and Altera, where select columns are designated as memory columns rather than logic columns (Fig. 1) . Organizing the memory tiles into a homogeneous column rather than placing them more freely in the mesh allows them the freedom to have a different size than the logic tiles. For example, if the memory block requires more area than the logic cluster, we can make the memory column wider without creating irregularity within rows or columns. Altera uses this column memory model in their Cyclone and Stratix architectures, and the M9K blocks in the Stratix III [12] are roughly 3× the area of the logic clusters (LABs) [20] , while being logically organized in the mesh as a single tile. Large memories can span multiple rows, such as the M144K blocks in the Stratix III, which are 8-rows tall while remaining one logical row wide, accommodated by making the column wider as detailed above.
Within this architectural framework, we can vary the proportion of memory tiles to logic tiles by selecting the fraction of columns that are assigned to memory tiles rather than logic tiles. One way to characterize this is to set the number of logic columns between memory columns, dm. VTR identifies this as a repeat parameter (repeat=dm + 1). For two memory sizes, dm still gives the spacing between memory columns, but we first use h1/h0 memory columns with small memories of height h0, followed by one column of large memories of height h1, so that the area occupied by the small memories is equal to the area occupied by the large ones.
Energy Modeling and Optimization
Poon [17] developed energy modeling for FPGAs and identified how to size LUTs (4-LUTs), clusters (8) (9) (10) , and segments (length 1) to minimize energy. However, Poon did not identify an energy-minimizing memory organization. FPGA energy modeling has since been expanded to modern directdrive architectures and integrated into VTR [6] .
Recent work on memory architecture has focused on area optimization rather than energy. Luu examined the areaefficiency of memory packing and concluded that it was valuable to support two different memory block sizes in FPGAs [14] . Lewis showed how to size memories for area optimization in the Stratix V and concluded that a single 20Kb memory was superior to the combination of 9Kb and 144Kb memories in previous Stratix architectures [13] , but did not address energy consumption, leaving open the question of whether energy-optimized memory architectures would be different from area-optimized ones.
Memory Energy Modeling
We use CACTI 6.5 [16] to model the physical parameters (area, energy, delay) of memories as a function of capacity, organization, and technology. In addition to modeling capacity and datapath width, CACTI explores internal implementation parameters to perform trade-offs among area, delay, throughput, and power. We use it to supply the mem- We set it to optimize for the energy-delay-squared product. For internal banking ( Fig. 2) , CACTI gives us the area and energy (Emem) of the memory banks, and we compute wire signaling energy (Ewires) to communicate data and addresses between the referenced memory bank and the memory block I/O. For example, consider the data in Fig. 4 for a 1024×32b (32Kb) internally-banked memory. A monolithic 32Kb memory block is 67µm×113µm, which is high enough to contain the 94µm required for the height of the 768 × 32 memory of size 58µm×94µm (plus room for extra logic), as shown in Fig. 4 . The total width in Fig. 4 is 58 + 31 = 89µm, or 89/67 = 1.33× that of the monolithic 32Kb memory block. We therefore adjust Emseg(32K-banked)= 1.33 × Emseg(32K). CACTI directly provides Emem. For the largest bank, Ewires is (31µm)(1 + 10 + 64)Cwire (V dd )
2 . Cwire = 180pF/m, V dd = 0.95V. (1 + 10 + 64) corresponds to one signal for the enable, 10 for the address bits, 64 for the 32b input and 32b output. 31µm is the distance to reach the large bank. The medium-size bank has a similar Ewires equation but with 38µm instead of 31µm, and the small bank has Ewires = 0. Then, the energy of an internally-banked memory is given by E banked = Emem + αEwires, where α is the average activity factor over all signaling wires. 
METHODOLOGY
Activity Factor Simulation
Activity factors and static probabilities assigned to the nets of a design have a major impact on the estimated energy. Common ways to estimate activity include assigning a uniform activity to all nets (e.g., 15%), or performing vectorless estimation with tools such as ACE [11] , as done by VTR. For better accuracy, our flow obtains activity factors by simulating the designs. We run a logic simulation on the BLIF output of ABC (pre-vpr.blif file) on a uniformly random input dataset. For example, for the matched 32Kb-memory matrix-multiply design in Fig. 3a , the average simulated activity factor is 11%, whereas ACE estimates it to be 3%, resulting in an energy estimation that is off by ≈ 3.7×. The tunable benchmarks are designed in a streaming way that activates all memories all the time (independent of the random data), except Matrix Multiply (MMul), for which the clock-enable signal is on ≈ 1/3 of the time. The VTR benchmarks do not come with clock-enable for the memories, so we set them to be always on. 
Power-Optimized Memory Mapping
When mapping logical memories onto physical memories, FPGA tools can often choose to optimize for either delay or energy using power-aware memory balancing [19] . For example, when implementing a 2K×32b logical memory using eight 256×32b physical memories, the tool could choose to read W = 4b from each memory (delay-optimized, Fig. 6b ). Since each memory internally reads at the full, native width, the cost of the memory operation is multiplied by the number of memory blocks used. Alternatively, it could read W = 32b from only one of the memories (Fig. 6a) , in which case only one memory is activated at a time (reducing memory energy), but extra logic and routing overhead is added to select the appropriate memory and data. The poweroptimized case often lies between these extremes. For example, as our experiment shows in Fig. 6c , the optimum is to activate 2 memories at once and read W = 16b from each.
Unfortunately, the VTR flow does not perform this kind of trade-off: it always optimizes for delay. Odin decomposes the memories into individual output bits [18] , and the packer packs together these 1-bit slices as much as possible within the memory blocks to achieve the intended width [14] . In fact, VTR memories do not have a clock-enable so they must be activated all the time. Instead, we use VTR architectures with special memory block instantiations that contain a clock-enable, modify VTR's architecture-generation script (arch_gen.py) to support these blocks and to support two memory sizes, and add a p-opt stage before Odin to perform power-optimized memory mapping based on the memories available in the architecture. This includes performing memory sweeps as illustrated in Fig. 6c to select the appropriate mapping for each application memory. The impact of this optimization is shown for the best architectures in Tab. 2. We find that mapping without p-opt adds 4-19% geomean energy overhead for the optimum architectures, comparable to the 6% benefit reported in [19] . Not using p-opt adds 40-108% worst-case energy overhead, suggesting that this optimization is more important for the designs with high memory overhead. Our p-opt code and associated VTR architecture generation script can be found online [7] .
Logic Architecture
The logic architecture uses k4n10 logic blocks (clusters of 10 4-LUTs) and 36×36 embedded multipliers (which can be decomposed into two 18×18 multiplies, or four 9×9 multiplies) with dmpy=10 and the same shape and energy as in VTR's default 22 nm architectures (a height of 4 logic tiles, plus we use Lmpyseg = 4Lseg). The routing architecture uses direct-drive segments of length 1 with Wilton switch-boxes.
Technology
We use Low Power (LP) 22 nm technology [1] for logic evaluation and Low Stand-by Power (LSTP) for memories. We use ITRS parameters for constants such as the unit capacitance of a wire at 22 nm (Cwire = 180 pF/m). Then:
C metal = Cwire × tile-length (6) We evaluate interconnect energy based on this C metal , instead of the constant one that is provided in the architecture file. This way, the actual size of the low-level components of the given architecture and technology, as well as the computed channel width, are taken into account when evaluating energy. It is important to model this accurately since routing energy dominates total FPGA energy (See Fig. 6c ).
Energy and Area of Memory Blocks
VTR assigns one type of block to each column on the FPGA (logic cluster, multiplier, or memory), and can give them different heights, but assumes the same horizontal segment length crossing each column. However, some memories can occupy a much larger area than a logic tile, and laying them out vertically to fit in one logic tile width would be inefficient. For energy efficiency, the memories should be closer to a square shape, and to that end, we allow the horizontal segment length crossing memories to be longer (which costs more routing energy, hence Emseg(M ) = Eseg in Section 2). We fix the height of the memory (h) ahead of time, but keep the horizontal memory segment length (Lmseg) floating:
Here W0 is a typical channel width for the architecture and benchmark set. We use W0 = 80. Then, when VPR finds the exact channel width, Wact, and hence the tile-length and area (A logic ), we can adjust Lmseg accordingly:
Amem is the area for the memory obtained from CACTI, and Asw is the switch area required to connect the memory to the FPGA interconnect. We obtain Asw from VPR's low-level models, similar to the way it computes A logic = A luts +Asw.
Benchmarks
To explore the impact of memory architecture, we use the VTR 7 Verilog benchmarks 1 [15] and a set of tunable benchmarks that allow us to change the parallelism level, P , in Section 6. Tab. 1 summarizes the benchmarks that have memories. We expect future FPGA applications to use more memory than the VTR 7 benchmarks. Some of them, such as stereovision, only model the compute part of the application and assume off-chip memory. We expect this memory to move on chip in future FPGAs. The tunable benchmarks provide better coverage of the large memory applications we think will be more typical of future FPGA applications. For this reason, we do not expect a simple average of the benchmarks, such as the geometric mean, to be the most meaningful metric for the design of future FPGAs-it is weighted too heavily by memory-free and memory-poor applications.
We implemented the tunable benchmarks in Bluespec SystemVerilog [4] ; they are the following: 1 Except LU32PEEng and LU64PEEng (similar to LU8PEEng) on which VPR routing did not complete after 10 days. GMM: Gaussian Mixture Modeling [5] for an N ×N pixel image, with 16b per pixel and M = 5 models. P pixels are computed every cycle. This operation is embarrassingly parallel, since each PE is independent of the other ones.
WFilter: 5×5 Gaussian Window Filter for an N ×N pixel image, with 16b per pixel and power-of-2 coefficients. P = 1/5 corresponds to a single PE that needs 5 main memory reads per pixel (storing the last 24 values read in registers). P = 1 adds line buffers so that only 1 main memory read and 4 line buffer reads translate into 1 pixel per cycle. P = 2 and P = 4 extend the filter's window, share line buffers, and compute 2 and 4 pixels per cycle, respectively. For P > 4, every time P is doubled, the image is divided into two subimages, similar to the GMM benchmark.
MMul: N × N matrix-multiply (A × B = C), with 32b integer values and datapaths (See Fig. 11) .
FFT: N -point 16b fixed-point complex streaming Radix-2 Fast Fourier Transform, with P × log(N/P )-stage FFTs followed by log(P ) recombining stages.
Sort: N -point 32b streaming mergesort [9] , where each datapoint also has a log(N )-bit index. One value is processed per cycle, and the parallelism comes from implementing the last log(P ) stages spatially.
In Section 5, we use P = 1 for each of these benchmarks.
Limit Study and Mismatch Lower Bound
Section 5 shows the energy consumption for different applications and memory architectures (e.g., Figs. 7, 8, 9 ). In order to identify bounds on the mismatch ratio, we also set up limit study experiments. Our limit study assumes that each benchmark gets exactly the physical memory depth it needs (the width stays at 32), as if the FPGA were an ASIC. Therefore, there is no overhead for using memories that are too small (no need for internal banking as in Section 2) or too large (no need to combine multiple memory blocks as in Section 4.2). We further assume that the limit study memories have the same height as that of a logic tile, making them widely available and keeping the interconnect energy low for vertical memory crossings. Finally, we place memory blocks every 2 columns (dm = 1), so that placeand-route tools can always find a memory right where they need one. To avoid overcharging for unnecessary memory columns, we modify routing energy calculations, and ignore horizontal memory-column crossings for the limit study (Emseg = 0). Some large-memory benchmarks drop slightly
16 [2] 32 [2] 64 [3] 128 [3] 256 [7] 0 100 200 300 400 When the large memories are decomposed, they can see benefits similar to internal banking, where memory references to component memory blocks close to the output require less energy than references to the full, application-sized memory bank.
EXPERIMENTS
Memory Block Size Sweep
We start with the simplest memory organization that uses a single memory block size and no internal banking (Fig. 7) , with the dm values from Fig. 3a . For comparison, energy is normalized to the lower-bound obtained using the limit study. Most of the curves have an energy-minimizing memory size between the two extreme ends (1Kb and 256Kb). Benchmarks with little memory have an energy-minimizing point at the smallest memory size (1Kb). Benchmarks with no memory have a close-to-flat curve, paying only to route over memories, but not for reads from large memories. The 16Kb memory architecture minimizes the geometric mean energy overhead of all the benchmarks at 37%. As noted (Section 4.6), the geometric mean is weighted heavily by the many benchmarks with little or no memory, so may not be the ideal optimization target for future FPGA applications. spree and mkPktMerge define the maximum energy overhead curve and suggest that a 4Kb memory minimizes worst-energy overhead at 130% the lower bound. Fig. 8 shows the detailed breakdown of energy components for three benchmarks as a single memory block size is varied, both for the normal case (top row) and for the internallybanked memories (bottom row) described in Section 3.3.
MMul128 (P=1) mkSMAdapter4B
GMM128 (P=1) We also show a blue line highlighting the optimistic lower bound obtained from the limit study Most benchmarks have the shape of MMul128, with an energy-minimizing memory size between 1Kb and 256Kb. Small benchmarks with small memories have the shape of mkSMAdapter4B, with large increases in memory energy with increasing physical memory size. Fig. 8 shows how internal banking reduces this effect. Sometimes this allows the minimum energy point to shift. For example, in GMM128 the minimum shifts from 64Kb to 256Kb, reducing total energy at the energy-minimizing block size from 553 pJ to 495 pJ, a reduction of 10%.
Full Parameter Sweep
In Section 2 we showed analytically why the spacing between memory columns, dm, should be chosen to balance logic and memory in order to minimize worst-case energy consumption. For simplicity, we limited Fig. 7 to only use the dm values from our mismatch experiment (Fig. 3a) . Since the optimal values of dm may vary among benchmarks, Fig. 9 shows geomean (a) and worst-case (b) energy overheads when varying both memory block size and dm.
In Tab. 2, we identify the energy-minimizing architectures for each of the four architectural approaches (1 or 2 memory sizes, internalbanking or not). We also compare to the lower-bound energy ratios for our closest approximation to the Cyclone (C with dm = 9 and 8Kb blocks) and Stratix (S with dm = 9 and 16Kb blocks) architectures.
2
The heatmap (Fig. 9) gives a broader picture than the specific energy minimizing points, showing how energy increases as we move away from the identified points. We see broad ranges of values that achieve near the lowest geometric mean point, with narrower regions, often single points, that minimize the worstcase overhead. The heatmap shows that overhead has a stronger dependence on memory block size than memory spacing. The commercial designs are appropriately on the broad energy-minimizing valley for geometric mean, but the large spacing, dm, leaves them away from the worst-case energy-minimizing valley. Multiple memories and internal banking both reduce energy, and their combination achieves the lowest energies. Compared to the commercial architectures, we identify points that reduce the worst-case by 47% ((2.47-1.31)/2.47) while reducing the geomean by 13%. In all the energy minimizing cases, the memories are placed more frequently than the commercial architectures, closer to the balanced point identified analytically in Section 2. 
Area-Energy Trade-off
Since there are valleys with many energy points at or close to the minimum energy, parameter selection merits some attention to area. Furthermore, it is useful to understand how much area we trade off for various energy gains. Fig. 10 shows the energy-area trade-off points when varying memory size and organization (1 vs. 2 memories, internal banking vs. not). To simplify the figure, we only show the paretooptimal points of each set. Energy is normalized to the limit study, while area is normalized to the smallest area achieved.
The architectures with two memory sizes are particularly effective at keeping both worst-case area and worst-case energy overhead low. The architecture that minimizes worstcase area-overhead is a single 16Kb memory design with dm = 5, requiring 80% more energy than the design that minimizes worst-case energy (which requires 33% more area). The designs that minimize geomean form a tight cluster spanning 28% area and 16% energy. Overall, the Stratix and Cyclone architectures fit into this geomean area-and energy-minimizing cluster, but are far from the pareto optimum values in the worst-case energy-area graph. This suggests the commercial architectures are well optimized for the logic-rich design mix captured in the VTR benchmark set. However, as FPGAs see more computing tasks with greater memory use and a larger range of logic-memory balance, our results suggest there are architectural options that provide tighter guarantees of low energy and area overhead.
Sensitivity
The best memory sizes and the magnitude of benefits achievable are sensitive to the relative cost of memory energy compared to interconnect energy. Since PowerPlay [2] estimates that the Altera memories are more expensive (about 3× the energy-perhaps because the Altera memories are optimized for delay and robustness rather than energy) than the energy-delay-squared-optimized memories CACTI predicts are possible, it is useful to understand how this effect might change the selection of architecture. Therefore, we perform a sensitivity analysis where we multiply the energy numbers reported by CACTI by factors of 2×, 3×, and 4×. The results for a single memory block size are shown in Fig. 9c . Without internal banking, the relative overhead cost of using an oversized memory is increased, shifting the energy-minimizing bank size down to 4Kb or 2Kb. At 2× the CACTI energy, the benefit of internal-banking is roughly the same at 30%, but drops to 19% by 4×. 38  42  38  61  63  92  130 230   43  33  30  26  37  41  58  85  150   39  32  27  25  32  37  44  79  140   42  31  28  21  29  34  41  73  120   39  30  25  21  26  32  37  69  110   42  32  24  23  24  32  36  68  110   41  28  23  21  24  29  34  69  100   43  30  27  24  25  31  32  65  98   47  32  22  22  24  31  32  64  100   50  32  28  25  24  32  34  66  98   50  39  36  35  48  53  77  120 210 Mem Size (Kb)   dm   1  2  4  8  16  32  64  128  256   1   2   3   4   5   6   7   8   9   10   11   44  37  40  36  61  51  89  78  180   43  32  27  21  32  25  48  42  98   37  30  23  18  24  19  30  32  74   40  29  23  14  20  15  25  26  56   37  27  21  14  16  13  21  24  48   41  30  20  15  14  13  18  22  46   39  25  18  13  13  11  16  23  40   41  27  23  17  14  12  14  20  37   46  29  17  14  13  12  14  19  38   48  29  23  17  14  14  16  21  37   51  38  34  32  47  41  74  73 52  96  170 370   380 380 140 210 140  87  86  150 310   580 310 320 120 170  94  100 150 310   610 360 240 140 120  58  89  180 330   760 500 190 210  96  120 100 170 330   640 300 180 190 140  63  110 180 340   690 310 420 310 180 100 110 180 330   780 410 170 200 130  93  140 220 390   1300 340 300 210 130 100 140 210 330   780 480 250 320 170 160 320 340 
PARALLELISM TUNING
For many designs, we can choose either to serialize the computation on a single processing element (PE), requiring a large memory for the PE, or to parallelize the computation with many PEs, each with smaller memories. For the Stratix IV with two memory levels (9Kb and 144Kb), we previously showed [8] that parallel designs with many PEs improved the energy-efficiency over sequential designs. In this section, we ask: what is the optimal memory architecture, and minimum energy achievable, when we can vary the parallelism in the design to find the energy-minimizing configuration for each memory architecture?
Issues
When we increase the number of PEs, we can reduce the size of the dataset that must be processed by each PE, decomposing the memory needed by the application into multiple, smaller memories (smaller Mapp) and thus lowering the energy per memory access. Specifically, doubling the number of PEs often halves the Mapp and hence reduces memory energy by √ 2. For most designs, increasing the number of PEs also increases the data that must be routed among PEs and hence increases inter-PE routing energy. As long as the fraction of energy in memory remains larger than the inter-PE routing energy, increasing the number of PEs results in a net reduction in total energy. For example, Fig. 11 shows the shape of an N × N by N × N matrix-multiply A × B = C for different parallelism levels P (N = 4 is shown). The computation is decomposed by columns, with each PE performing the computation for N/P columns of the matrix. The B data is streamed in first and stored in P memories of size N 2 /P , then A is streamed in row major order. Each A datapoint (A[i, k]) is stored in a register, data for each column (j) is read from each B memory, a multiply-accumulate is computed
, and the result is stored in a C memory of size N/P .
3 Once all the A datapoints of a row have been processed, the results of the multiplyaccumulates can be streamed out, and the C memories can be used for the next row. When P = N , C does not need memories. Either way, increasing P keeps the total number of multiply-accumulates and memory operations constant. However, since the memories are organized in smaller banks, each memory access now costs less, and energy is reduced, as long as the interconnect-per-PE does not increase too much. Fig. 12 shows how energy-efficiency changes with PE count for three tunable benchmarks. It shows how energy is reduced with additional parallelism up to an energy-minimizing 
Experiments
Once the memory architecture is fixed (with a given memory block size), being able to tune the parallelism level of the benchmarks as in Section 6.1 allows us to reduce potential mismatches between the application's memory requirements and the memory architecture. For example, consider Fig. 13 , showing a sweep of both memory block size and parallelism for MMul128. The curves are normalized to the same limit study point as in Section 5 (P = 1), hence the curves can go below 0% overhead: the more parallel versions are different designs, and they can be more energy-efficient. For example, within the space of internally-banked, 1-memorysize architectures, Section 5 concluded that a memory size Figure 13 : Energy versus P for MMul128 and varying mem sizes (int-bank) (normalized to the limit study for P = 1) of 32Kb minimizes energy. In the case of MMul128, this gave an overhead of 10%. However, tuning the benchmark to P = 8 brings it down to -15%, a 23% reduction. This shows that parallelism can be a powerful optimization to reduce energy even when we do not have control over the memory architecture. In fact, MMul128 has a point at -17% overhead (P = 4 and a 128Kb memory), suggesting that the ability to tune P may shift the optimum architecture. Fig. 14 shows the effect of tuning to optimum P for different memory sizes and dm values. Due to space constraints, we show only the 2-memory, internally banked designs, which contain the lowest energy points. We can observe three effects from parallelism tuning that the previous sections have set up: (1) reduce the absolute energies, and hence overheads, achievable; (2) shift the energy-minimizing parameter selections to smaller memories (e.g., 1Kb+64Kb vs. 8Kb+256Kb for worst-case); and (3) create broader near-minimum-energy valleys, making energy overhead less sensitive to the selection of memory block size.
CONCLUSIONS
We have shown how to size and place embedded memory blocks to guarantee that energy is within a factor of two of the optimal organization for the application. We focused on energy-balanced FPGA design points, which may be different from the logic-rich design points for current commercial architectures. On the benchmark set, we have seen that a two-memory design with 8Kb and 256Kb banks with internal banking and dm = 4 keeps the worst-case mismatch energy overhead below 31% compared to an optimistic limit-study lower bound. Internal banking provided 19% of the energy savings. The optimal energy-balanced memory architecture for energy minimization differs from the logic-rich, area-minimizing points: we are driven to multiple memory sizes (8Kb and 256Kb vs. single 16Kb) and more frequent (dm=4 vs. dm=5-9) memories, spending 33% more worst-case area (28% more geomean area on logic-rich benchmarks) for 80% lower worst-case energy (16% lower geomean energy). Finally, tuning parallelism in the application can reshape the memory use, reducing the energy overhead by avoiding memory size mismatches. Joint optimization further reduces the worst-case energy overhead by 13%, and the geomean by 25%.
Geomean (% overhead)
Worst-case (% overhead) Base case: P = 1 
