As the logic capacity of Field-Programmable Gate Arrays (FPGAs) increases, they are being increasingly used to implement lnrge arithmetic-intensive npplicarions, which ofren contain a large proporrion of daraparh circuits. Since daraparh circuits usually consisr of regularly structured components, called bit-slices, it is possible to utilize daraparh regulariry in order to achieve significant area savings through FPGA archirecrural innovations. This paper describes suck an FPGA logic block architecture. called a mulri-bit logic block, which employs configurarion memo9 sharing to exploir datapath regularity. It is experimentally shown rhar, comparing ro conventional FPGA logic blocks, the multi-bit logic blocks can achieve 18% to 26% logic block area reduction for implementing dataparh circuits, which represents an overall FPGA area saving of 5% to 13%. A packing algorithm for the multi-bit logic block architecture is also proposed in this paper; and it is used to empirically find rhe best values for several important archifecrural parameters of the new anhirectiire. including rhe most area eficient graiiularip values and the most area eficient amounr of configurarion memory sharing.
Introduction
Field-Programmable Gate Arrays (FPGAs) that process multiple bits of data at a time represent a new architectural approach for implementing datapath circuits on reconfigurable hardware that can significantly reduce the amount of programming information required to configure an FPGA. The main benefit of this reduction in programming information is the subsequent reduction of configuration memory bits, which can lead to increases in logic density. Called multi-bit FPGAs. the detailed implementation of these devices often consists of multiple-bit wide logic blocks and routing resources that take advantage of datapath regularity by sharing a single set of configuration memory across multiple sets of programmable resources. This sharing results in a denser FPGA that is especially efficient at implementing large arithmetic-intensive datapath circuits including computer graphics, multimedia, digital signal processing, and Internet routing applications.
Several multi-bit FPGA architectures have been proposed in the past [11- [12] with a wide range of logic block designs. In this work, we focus on the study of logic cluster-based multi-bit logic blocks. In particular, we propose a specific logic block architecture along with its packing algorithm (the step in the CAD flow that chooses which logic elements to group together in a cluster). The area efficiency of the proposed logic block architecture is then empirically evaluated. The primary reason for the choice of logic cluster-based logic blocks is due to the fact that logic clusters are the building blocks of many state-of-the-art commercial FPGAs (including the Alterd Flex, Stratix, and Cyclone series [I71 and Xilinx 5200, Virtex, and Spartan families [IS] of FPGAs), and with their ever-increasing logic capacity, commercial FPGAs are being increasingly used to implement large datapath-intensive applications.
For multi-bit FPGAs, it is essential to have a set of automated design tools in order to make the effective use of their multi-bit architectures. As a result. a set of datapath-oriented CAD tools, including synthesis. packing, placement, and routing tools, have been developed at the University of Toronto: and in this paper, we focus on the particular problem of automated packing. Packing for multi-bit FPGAs is more difficult than classical packing [I41 1151 [161, because, to effectively utilize configuration memory sharing, the packer has to preserve the regularity of datapath circuits on top of the conventional packing objectives of achieving the smallest possible implementation area and the shortest possible critical path delay.
To investigate the area efficiency ofthe logic block architecture, we experimentally determine the best granularity values of and the best amount of configuration memory sharing for the proposed logic blocks.
Extensive research 1191 [20] [I51 [21] has been conducted in the past in order to determine the best sizes and structures for conventional FPGA logic blocks. These studies have shown the importance of logic block architecture on the overall area-efficiency of FPGAs. None of the studies. however. considers the problem of configuration memory sharing, which requires the preservation of datapath regularity (all these studies use conventional synthesis and packing algorithms, which destroy the regularity of datapath circuits and essentially turn datapath into finite state machine-like netlists of "randomly" connected logic gates). In this study, [22] is used, which preserves a great amount of user-specified regularity. The preserved regularity, in turn. is used by the packing algorithm to investigate the area efficiency of the proposed logic blocks.
The rest of this paper is organized as follows: Section 2 presents the multi-hir logic block architecture: Section 3 describes the packing algorithm; Section 4 presents the experimental results on the area efficiency of the proposed logic blocks; and Section 5 gives concluding remarks.
The Multi-Bit Logic Block Architecture
The basic building blocks of the multi-hit logic blocks are logic clusters, which-were first introduced in [20] as a generalized form of the logic array blocks used in the Altera FLEXSK and FLEXIOK series of FPGAs. As shown in Figure 1 , each logic cluster is constructed out of Basic Logic Elements (BLEs), which consist of a 4-input Look-Up Table ( Table 1 lists the total active area consumed by logic clusters of various sizes and the total area consumed by the SRAM hits in column 4 and 5 , respectively. The SRAM area as a percentage of the tntal cluster area is shown in column 6. As shown, unlike the SRAM count, the total SRAM area as a percentage of the total cluster area nearly monotonically decreases with increasing N .
For small cluster sizes, the SRAM area consists of near 50% of the tntal cluster area; for extremely large cluster sizes, on the other hand, the SRAM cells consume less than 10% of the total cluster area. Most importantly, however, for the cluster sizes of 4 to 10, which were determined to he the most efficient cluster sizes by previous studies [201, the SRAM cells consume a suhstantial amount (between 48% to 39%) of the total cluster area. (Note that for the active area calculations, all transistors in a logic cluster are properly sized using the methodology outlined in [23] .)
The large amount of area consumed by the SRAM cells motivates the multi-hit logic block design, which shares the configuration memory across the logic clusters. Figure 2 shows the StNCture of a multi-hit logic block. Here, each logic block contains M logic clusters. where M is called the granulariiy of the logic block.
Note that each cluster is designed to implement a single bit-slice of a datapath circuit and the clusters from a single logic block are used to implement the adjacent bitslices.
As shown in Fizure 2. the configuration memory is shared among M corresponding resources from distinct logic clusters. It is assumed that when the configuration memory of a BLE is shared, the configuration memory of all of its input multiplexers must also be shared. It is also assumed that not all BLEs in a logic cluster must he controlled by shared configuration memory; and the degree of configuration memory sharing, N,, is defined to be the actual number of BLEs in each logic cluster that are controlled by shared configuration. Table 2 shows the average active area per logic cluster for cluster size (NI of 4 and cluster input ( I ) Although multi-bit logic blocks can consume much less area per cluster due to configuration memory sharing. they might also have lower rate of utilization if they are used to implement irregular circuits. The rest of this paper proposes an automated packing algorithm that preserves as much datapath regularity as possible; and the algorithm is then used to investigate the appropriate granularity values and degrees of configuration memory sharing for multi-bit logic blocks. 
Packing for Multi-Bit Logic Blocks
The overall flow of the packing algorithm consists of two major steps. In step I , initialization, the algorithm adjusts the granularity of a graph that represents the input datapath circuit and performs timing analysis.
In step two, packing, the algorithm groups nodes of the graph into multi-hit logic blocks. The graph and the two packing steps are described in turn.
Datapath Circuit Representation
Since the primary purpose of the packing algorithm is to preserve datapath regularity, an appropriate format for specifying datapath regularity must be one BLE is called afine-grain node; and it represents a BLE that does not belong to any datapath. A node containing more than one BLE, on the other hand, is called a corrrce-groin node; and each BLE in the coarse-grain node is from a unique bit-slice of a datapath circuit.
An example of the coarse-grain node graph is shown in Figure 3 , which represents the datapath circuit shown in Figure 4 . The graph consists of 1 I interconnected nodes representing the 25 BLEs in the circuit.
Nodes A through F are 3-bit wide coarse-grain nodes: along with the 2-hit wide nodes E and F'. they represent the eight bit-slices of the datapath. Nodes G. H. and 1, on the other hand, are fine-grain nodes, which represent BLEs with the corresponding labels in the irregular logic part of the circuit.
Step 1: Initialization
The initialization step consists of two sub-steps. First, each coarse-grain node whose granularity value is Coarse-Grain Node Graph greater than the granularity value of the txget architecture ( M ) is transformed into a set of nodes. Each node in the set has a granularity value that is smaller than or equal to the granularity of the multi-bit logic blocks. In particular, given a coarse-grain node that is more than M bits wide, starting at the most significant bit of the node, the packing algorithm continuously groups M neighboring BLEs into new coarse-grain nodes. If there are less than M BLEs remaining at the least significant end, these remaining BLEs are grouped by themselves into a node that is less than bit wide. These newly formed nodes are then used to substitute the original node in the coarse-grain node graph.
Timing analysis is then performed on the input circuit. During timing analysis, the propagation delay and the expected arrival time of each BLE input or output pin is calculated. The slack of each net is then derived from the delay and the expected arrival time. Finally. the criticality value [I51 of each net is calculated using the formula:
where maxulack is the maximum slack of the input circuit.
3.3.
Step 2: Packing During step 2, new multi-bit logic blocks are created one at a time and each logic block is filled with nodes from the coarse-grain node graph. Nodes are added to a logic block in a predetermined order. Assuming that the ith BLE in the jth cluster is denoted by the It is also assumed that each multi-bit logic block contains a carry network as the one shown in Figure 6 .
Because of the carry network, not all BLE positions in a cluster are logically equivalent. This lack of equivalency is the reason why the packing algorithm must select nodes for each specific positions in a logic block. An example is shown in Figure 6 . Here there are three BLEs, A, B, and C, in a logic block. These BLEs are connected by a carry chain through the carry network. In the figure, the BLE position ( I , I ) is equivalent to the BLE position (3, I ) : therefore, BLE A can be moved to position (3, 1) provided that BLEs B and C are also moved to position (3.2) and (3, 3) respectively.
However, BLE A cannot be moved to position ( 2 , l ) or (4, I ) since these two positions are not equivalent to BLE position (1, 1) due to the difference in their carry connections.
The remainder of this section describes the two criticality functions, including the seed criticality function and the attraction criticality function, which are used in the packing process.
Seed Criticality
The first node added to a logic block is called a seed. It is selected using a metric called the seed criti- does not necessarily improve the performance of a circuit; however, when a subsequent node. A, is added to the same logic block, many two-terminal connections that connect the seed node and node A can then be implemented in the local routing networks or the carry network of the logic block, which are inherently much faster than global routing. Consequently, the performance of the circuit is improved. The seed criticality measures the maximum possible performance improvement; and each two-terminal connection that can be implemented inside the logic block is called a potential locnl connection.
Potential local connections can be identified using a pattem matching process against one of the four topologies shown in Figure 7 . Here topology A and B contain connections that can be implemented in the carry network of the logic block; and topology C and D contain connections that can be implemented in the local routing networks of the logic clusters.
The formula for calculating seed criticality is shown in Equation .
In the equation, the function The function, max(S(r1)). corresponds to the maximum speed impmvement achievable by implementing II as a seed node. cnt(S(,ijj is a tie breaker: and it counts the number of potential local connections that can achieve the maximum speed improvement, Note that max(s(n)) and cnt(s(n)) are analogous to the base seed criticality and the number of path affected metrics used in [15], respectively. These functions, however, are more general in nature and are applicable to a wider range of FF' GA clustering architectures than the fully connected topology assumed by [151.
The metric distance to source. d ( n ) , on the other hand, is an unmodified version of the same metric defined in [151. Nodes with the same max(S(n)) values usually are connected together by a single critical path.
d ( n ) measures the order of these nodes along the critical path. Everything else being equal, the node that is the furthest from the source of the critical path is given the highest priority for implementation as a seed node.
Attraction Criticality
Once a seed is added to a logic block, the logic block is then filled based on the attraction criticality metric. Here, each node in the coarse-grain node graph is assigned an attraction criticality value according to Equation 6 . The metric consists of four parts: the base seed criticality, E ( " ) , accounts for the performance improvement of implementing the node in the logic block; shared VO count, a n ) , accounts for the number of additional cluster I/Os that is needed to implement the node; and finally secondary attraction criticality,
B , (~J ,
and common U 0 count, c,(n), account for the closeness of the placement resulting from adding the node to the logic block. These four parts are weighted and summed into the attraction criticality. Each part is described in turn.
annction~cnticality(,r )= 3.5.1. Base Seed Criticality As shown in Figure 8 . for logic blocks containing at least one node, the connections between the node and the logic block can be classified into two types. The first type consists of connections that can be implemented in the local routing networks of the clusters or the carry network that connects the clusters together. The second type consists of connections that have to be routed through global routing. The implementation of the first type of connections often results in increased performance; and this increase is measured by the base attraction criticality. It is equal to the maximum criticality among all type one connections in addition to all the internal connections of the node that can be implemented in the carry network. connections in addition to all internal connections of the node that must be routed through the global routing network.
3.5.3. Shared U 0 Count Since cluster inputs are limited routing resources, it is important to minimize their usage when adding nodes to logic blocks. As in [IS], it is preferable to choose BLEs with the following three types of VOs for a cluster:
1. a BLE input that is connected to the same net as one of the cluster inputs 2. a BLE input that is connected to one of the cluster outputs 3 . a BLE output that is connected to a cluster input
The shared VO count metric measures the VO commonalities between a node and a logic block. It is equal to the total number of the three types of BLE VOs in a node when each BLE is matched with its corresponding cluster. Note that, in Equation 6, P,",, is defined to be the maximum possible value of the shared VO count metric. It is used in the equation to normalize the shared VO count to a value that is between 0 and I . 
Experimental Results
The packing algorithm has been used to pack several benchmark circuits into multi-bit logic blocks with various granularity values and degrees of configuration memory sharing. The packing results shown in this section are based on the fifteen datapath circuits from the Pico-Java Processor from Sun Microsystems [24] . Each circuit is first synthesized into several granularity values using a datapath-oriented synthesis algorithm [221:
-------7--r, dm-dpath I 958 ex_dpaih I 2823
PwnnnPnt

I 467
and Table 3 gives the name, size (number of BLEs) of each circuit for a given synthesis granularity value (here the synthesis granularity is defined as the maximum datapath width that is preserved by the synthesis pmcess). These values are the same as the ones shown in Table 3 . namely 1, 2, 4, 8, 12, and 16; and for each value of M , the degree of configuration memory sharing, N,, is also varied from 0 to 4. Note that each cluster is assumed to contain 10 ( I input pins and 4 ( N ) BLEs. The experimental results on regularity, cluster count, and area are presented in turn.
Regularity Results
Two yardsticks are used to measure the amount of regularity contained in the benchmark circuits based on the concept of a datapath component. Here, a datapath component is defined to be a group of identically con- 
Logic Cluster Count
As discussed in Section 2, the area savings of configuration memory sharing depends on two parameters -the cluster size and the cluster utilization. The cluster utilization can be easily measured by counting the total number of clusters required to implement the fifteen benchmark circuits; and this cluster count is shown in Figure 10 . In the figure, the granularity value is shown on the x-axis and the total number of clusters required to implement the fifteen benchmark circuits is shown on the y-axis. There are five lines in the figure, each representing one of the five possible degrees of contiguration memory sharing (0, 1, 2, 3, and 4). As expected, when there is no configuration memory sharing, the cluster count is the lowest for a given granularity value;
and as the degree of configuration memory sharing increases, so does the cluster count. More interestingly, concurring with the regularity results, for the granularity values of 2 and 4. the increase in the degree of configuration memory sharing from 0 to 3, only results in small increases in cluster count (less than 5% for M = 2 and 11 lo for M = 4 ); and for the granularity value of 2, when N, is increased from 0 to 4, the cluster count is increased by only 8%. For all other granularity values substantial increases in cluster count is observed. 
Area Results
The area consumed by the multi-bit logic blocks is plotted in Figure 11 . In the figure the x-axis represents the granularity of the architecture, the y-axis represents the total logic block area required to implement the fifteen benchmark circuits. Assuming that the total logic block areaconsists of 30% to 50% of the total FPGA area, this logic block area saving represents an overall area saving of 5% to 9%. Finally, Figure 12 shows that the area savings also depends on the size of the SRAM cells. In the figure. it is assumes that each SRAM cell is 1.5 times of the standard size (Larger SRAM cell sizes can be used to improve fault tolerance). This increase in SRAM size results in larger area savings. The best area is achieved when M = 4 and N, = 4 ; and the area saving is 26%. which represents a total FPGA area savings of 8% (assuming 30% of FPGA area.is logic block area) to 13% (assuming 50% of FPGA area is logic block area).
Conclusions
This paper has described a new multi-bit logic block architecture for FPGAs and its associated packing algorithm. Using the packing algorithm, it is empirically shown that, for logic clusters containing 4 BLEs 
I4 16
Granularity (M) Figure 12 : Ama vs. Granularity (Large SRAM) and 10 cluster inputs, the must area efficient variant of the multi-hit logic block architecture contains four clusters per logic block and has three BLEs per logic cluster that are controlled by shared configuration memory. In this configuration. the multi-hit logic block area is 18% smaller than the conventional FPGA logic block area. This represents a 5% tu 9% reduction in the total FPGA area.
