When relying on module generators to implement regular datapaths on FPGAs, the coarse granularity of FPGA cells can lead to area and delay inefficiencies. We present a method to alleviate these problems by compacting adjacent modules using structure extraction, local logic synthesis, and cell replacement. The regular datapath structure is exploited and preserved, achieving faster layouts after shorter tool run-times.
Introduction
Regular datapaths are the core of many CPU and DSP architectures. The application of generator programs to create their constituent modules has a long history in VLSI design ( [2] , [6] , [10] , [13] , [14] , and many others). With growing FPGA die sizes, such datapath architectures are also implementable on FPGAs. However, current module generation techniques for FPGAs ( [5] , [19] , [1] ) do not address the area and delay inefficiencies caused by the coarse-grain architecture of FPGAs as compared to semi-custom or gate-array chips. Furthermore, misfeatures of current module generators include limited layout topology options [1] and the inability to regularly place simple non-FPGA-specific logic [19] .
The following paper presents a method that mitigates these inadequacies: A linear placement of generated modules with regular layouts is compacted without disrupting the efficient structure, regardless of whether the modules are FPGA-specific or simple. The datapath regularity of horizontal data and vertical control flow is actively exploited and has been implemented in the framework of SDI [12] . SDI consists of a complete suite of tools (a comprehensive library of parametric modules, module generators, a floorplanner, and the compactor) and a strategy for their application to implement an efficient datapath combined with an irregular controller. The tools are currently targeting Xilinx XC4000 FPGAs. However, the general procedure can be applied to all FPGAs with matrix architecture. This paper describes only the compaction step, which processes just the regular part of the circuit.
Problem Description
A strictly module-based layout consists of a regular (often linear) placement of regularly generated modules. Since a module is always at least one logic block wide, partially utilized blocks waste area and speed. The size of the wasted area and the loss in speed increase with the logic capacity of a single FPGA logic block and the number of modules in the datapath. Figure 1 is an example for such a scenario: The 3-bit datapath contains three regular modules AND2, OR2, and AND2B1, implementing the functionality of a 3-bit wide 2-1 multiplexer. However, even assuming relatively fine-grained logic blocks on the FPGA (e.g., Actel ACT logic modules, Atmel AT6000, or Xilinx XC6200 cells), the function MUX21 can be implemented in a single logic block per bit. Thus, the sample datapath wastes 2/3 of its area and only runs at 1/2 the speed of the single block solution. This situation becomes worse with coarser-grained blocks such as the N-LUTs AND2 AND2B1 MUX21 OR2 Fig. 1 : Wasted space in a strictly module-based layout found, e.g., in Xilinx XC3000/XC4000, AT&T ORCA, and Altera FLEX FPGAs. The compaction process breaks module boundaries in a strictly module-based layout and merges adjacent modules to better utilize the logic blocks. 
Fig. 2:
Steps of the compaction process technology mapping to the zone functions (Section 7). Since the placement information provided by the module generators is lost afterwards, the mapped FPGA blocks of the merged module have to be placed again in the context of the original floorplan. The specialized two-phase placement algorithm is timing-driven (Section 8) and takes the regular datapath structure and FPGAspecific routing topologies into account. During the first phase, blocks are placed horizontally, observing the alignment of adjacent zones, and vertical control signals are globally routed (Section 9.1). The second phase assigns row locations to the blocks (Section 9.2). Since vertical placement occurs separately for each zone, it has a smaller problem size and can thus be more detailed, allowing the use of a finer representation of the routing structure of the target FPGA.
Finally, the placed netlist of the sub-datapath is assembled by duplicating and vertically stacking the zones according to the original width requirements.
The result is a new regular module fitting within the initial floorplan, but with reduced area and number of logic levels. Pin assignment and routing still have to be performed using conventional tools. Currently, the PPR program of the Xilinx XACT suite is employed to handle these tasks (Section 10).
The compaction process in Figure 2 will be explained in detail in the next sections.
Definitions
A circuit consists of cells (nodes or ports) that can be placed at (x,y) inside or adjacent to a placement area with height H and width W . A datapath D is a sequence of modules describing a linear placement from left to right. An FPGA matrix is composed of a grid of blocks (e.g., XC4000 CLBs). 
Selecting Sub-Datapaths for Compaction
Prior to compaction, the floorplanner determines parts of the original datapath to be compacted (top of Figure 2 ). Although this selection is not part of the compaction operation itself, it significantly influences the quality of the resulting layout. Becausethe complete datapath might contain modules not amenable to compaction, the floorplanner has to determine the largest sets of suitable modules. Each of these sets is considered a sub-datapath of the whole datapath. The sub-datapaths are then handled independently, allowing the parallel compaction of each set of modules. Figure 4 shows an example: The floorplanner has calculated a linear placement of modules (case a). H 1 and H 2 are hardmacros, and thus mark the boundaries of the three compactable sub-datapaths {M 1 }, {M 2 , . . . , M 5 }, and {M 6 , M 7 }. An even tighter packing might be obtained if the boundaries were ignored (in Figure 4 .b, the duplicated function f is removed), but this would risk a degradation of wire lengths α and β over their pre-compaction levels. When the boundaries are respected, area is traded for speed: The compacted modules M 1 , M 2345 , and M 67 are larger than M 1234567 , but the wire lengths remain unaffected (Figure 4 .c). Since the compactor is primarily performance-oriented, it follows the approach in Figure 4 .c.
Zone Analysis and Merging
After determining their extent for compaction, each sub-datapath D is processed separately. Since they are independent of each other, all steps of the complete compaction process in Figure 2 can be performed in parallel for all such D. The regular structure of D is exploited to reduce run-times of the following compaction steps.
D is searched for zones of recurring logic as the first compaction S := {}; /* initially, we don't know any zones */ row := 1; /* start at the bottom of the datapath to compact ... */ /* ... and work your way upwards */ The algorithm in Table 1 analyzes a given D. Applied to the example of Figure 3 , it proceeds as follows: The initial bottom scanline collects ALU4/0 and DWN2/0 into s. Thus, h becomes 4. The inner loop hits ALU4/0 and DWN2/0 twice, then it advances upwards to hit ALU4/0 and DWN2/1 twice. Each instance hit, however, is added only once per module to temp (e.g., we don't add ALU4/0 twice). Since we have now reached the upper border of ALU4/0, the inner loop terminates, and we add the new zone to S with an iteration count of 1. We now repeat the process for the next row up and acquire a second zone of one slice containing ALU4/1, DWN2/2 and TOPDWN/0. This results in a stack S M f(fALU4, DWN2,DWN2)g,1), (fALU4, DWN2, TOPDWNg,1)g.
The networks in the master slices of the zones are now merged by following their intra-zone (but inter-module!) connections. Thus, D is being merged into a single M with the two slices ALU4-DWN2-DWN2 and ALU4-DWN2-TOPDWN.
In the current example, we have not saved any work, because each slice occurs just once in D. Nevertheless, if we assume a 12-bit datapath similar to the one in Figure 3 , the slice fALU4, DWN2, DWN2g would occur twice in D as zone (fALU4, DWN2, DWN2g, 2). It would only be processed once during compaction, the results being duplicated to build the 8-bit bottom zone of the 12-bit M . The gains are even more pronounced with wider datapaths, such as 32 bits. In addition, the zone analysis and merging operation can be performed in parallel on each of the compaction areas specified by the floorplanner. The algorithm in Table 1 is a simplified version. The full implementation also considers special cases like vertically overlapping slices and changing port locations between slices.
Logic Optimization
After obtaining the master slices of M , we now reduce area and delay. The regularity extraction performed on D in Figure 2 
Since the compaction process is generally independentof the subalgorithms employed, it can easily take advantage of any new advances in the fields of optimization and mapping. For example, the initial version of the compactor supported only the "xl" (MIS-PGA) commands in SIS 1.3 [17] to perform technology mapping to N-LUTs. The current compactor can also employ the more recent FlowMap package [8] that emphasizes delay over area minimization, allowing the user to make a trade-off by choosing the algorithm. Minimization and mapping transform the networks of FPGA blocks (CLBs for XC4000, Figure 5 .a) separately for each master slice of M into optimized networks of cells. All placement information is lost and has to be recreated by the following steps. For the XC4000, a cell consists of a 4-LUT, optionally combined with a flip-flop ( Figure 5.b) . The current compactor implementation does not attempt to handle irregularities in the FPGA logic blocks (e.g., the H-block in XC4000 CLBs). Thus, two cells fit inside the regular part of a CLB. With recent FPGA architectures striving to avoid irregular structures (e.g., Altera FLEX and Xilinx XC5000/XC6200chips), this restriction seems less severe and could even be removed by the integration of the appropriate CLB packing algorithms.
Pre-Placement Activities
Since the minimization and mapping steps change the circuits in the master slices, the initially generated module layouts are no longer valid and the cells of the slices have to be re-placed.
In order to execute a timing-driven cell placement, a critical path analysis of the complete M has to be performed (Figure 2.3) . To do so, M is assembled by interpreting the topology in S M and instantiating the slices accordingly. Next, the cells are interconnected with vertical inter-slice nets and control nets. The delay trace can then be executed using either the unit delay or unit-fanout delay models of SIS. Afterwards, the arrival and required times of inter-slice nets are back-annotated to their master slices. For input ports, the arrival time becomes the latest time at which their signal arrives at an instance of the slice. For output ports, the required time becomes the earliest time the signal is required in an instance.
While these timing constraints are not accurate enough to estimate a real inter-slice path through its master slices, they can be used to determine paths that are critical at all (having slacks ≤ 0). Multiterminal nets are decomposed into one or more paths of TTNs.
The result is a list of critical paths for each master slice, sorted by ascending length. The timing-driven placement uses these lists to minimize wire lengths on critical paths.
The floorplanner is responsible for determining the placement Since the two phases have different scopes (module in the horizontal vs. slice in the vertical phase), the placer uses different length metrics in each phase. Due to its more limited scope, the metric used in the vertical phase (Section 9.2) can be more precise than in the horizontal phase (Section 9.1).
Except for the allocation of vertical long lines (VLL) in the horizontal phase, no effort is made to balance congestion in routing channels. This seems feasible, because the pins on a CLB are interchangeable to a large degree. Thus, the pin assignment and routing steps can relieve congestion by swapping pins to less dense channels.
Both phases handle multi-locked ports identically (Figure 6 ).a. Assuming that a port P , sourced by node X, is multi-locked to all sides of the placement area, the distance
used as the length of TTN (X, P ) for critical path calculations will be modeled by taking the maximum distance of all TTNs connecting its source node X with the corresponding port location of P .
Horizontal Placement
During horizontal placement (Figure 2.4) , the placer strives to: (1) Assign cells to the columns of the placement area in order to minimize the number of VLLs used for control signal routing. The underlying ILP is based on the model shown in Figure 6 .b. The placement areas for Slice0, Slice1, and Slice2 in the example each consist of a (2,3) grid of cells. Data ports of M can be placed adjacent to the areas in columns 0 (for left ports) and 4 (right ports). Each column also has an associated control routing channel with 10 vertical long lines (VLL) for control routing (a maximum of 2 VLLs per channel is used in the example). This channel is assumed to lie left of the cell column. A control signal in channel n is available to cells in columns n and n−1 (e.g., control b in channel 2 reaches cells in columns 1 and 2). Note that for control routing, the channel W +1 directly to the right of the placement area (H , W ) is also considered available.
If necessary, control signals can be replicated and routed in multiple channels (not shown in the example). Thus, the number c of VLLs used for control routing can be greater than the number of control signals.
The The wiring delay of intra-slice TTNs, such as (A,B), is also modeled as |x A − x B |. However, this metric becomes increasingly inaccurate with growing H . Since the vertical distance is not known during this phase, it is currently approximated as H /4 . This assumption is based on the XC4000 topology of a maximum of one switch matrix for 4 cells (4-LUTs) in a (4,1) area. Thus, | A − B| becomes |x A − x B | + H/4 . Without an estimation, the model would try to minimize the wiring delays by mistakenly preferring the vertical over the horizontal direction. The layouts lacking an estimation are measurably worse in terms of delay than those with the proposed estimation. The impreciseness of this approximation can be justified with the intent of the compactor to process flat bit-slices instead of tall modules. Should this assumption fail, a more accurate assessment would be necessary.
Given the three quantities introduced in the preceding paragraphs, the objective function for the horizontal placement phase becomes min( w d d max + w c c + w a a max ). w d , w c , and w a are userdefinable weights. E.g., a user might increase w d over w c when a faster circuit at the cost of an increased number of control lines is desired.
Vertical Placement
In contrast to the horizontal phase, the vertical placement phase (Figure 2 .5) concentrates solely on wiring delay minimization on the critical paths. Since it is not concerned with inter-slice dependencies, its scope can be limited to a single master slice.
With the reduced problem size, it becomes possible to use a more precise model of the FPGA routing architecture that better reflects the non-continuous distance relations. This more detailed model generates measurably better layouts over those obtained using simple manhattan distances, especially for more complex slices. Figure 7 shows the model, which is a simplified view of the XC4000 routing network. Cells A to I have been labeled to serve as example TTN nodes in further explanations. The model encompasses direct connections (no switch matrices passed) and general single-length connections (one switch matrix per segment). Vertical long lines were handled in the horizontal placement phase. Horizontal long lines were allocated during floorplanning to create chip-wide busses or to route long-range inter-module signals. To limit the complexity of the model, double-length lines are presently not included.
The horizontal phase was concerned only with placing cells. The vertical phase, however, has to take FPGA block boundaries into ac- count and thus operates on a CLB matrix with the same width, but half the height of its underlying cell matrix. The upper cell of a CLB will be placed in the G-LUT and thus use the Y and YQ outputs, the lower cell will be located in the F-LUT with its output being routed through the X and XQ pins ( Figure 5 ). Y/YQ and X/XQ output pins are assumed equivalent for routing purposes: The Y/YQ pins reach above and to the right of their CLB, the X/XQ pins below and to the left. The location of input pins is not modeled because they are located at all four sides of the CLB. A signal is assumed to be available at the inputs of all cells within a CLB when it reaches the CLB boundary.
The metric employed in this phase is not based on simple manhattan distances, but only on an actual count of switch matrices (SM) in a signal path. In order to do so, three major cases based on the horizontal distance of cells (a, b) of a TTN have to be considered. For each case and sub-case, the corresponding TTNs in Figure 7 will be pointed out.
If the horizontal distance is 0, the SM-distance is the simple CLB manhattan distance If the horizontal distance is greater than 1, another effect be- 
Experimental Results
The compactor has been implemented as part of the SDI strategy [12] . It consists of 6000 lines of C that extend SIS 1.3 [17] . The models are formulated as pure 0-1 problems to allow pre-processing by OPBDP [4] , which performs "logic optimization" on the ILPs and quickly generates an upper bound using constructive enumeration techniques. CPLEX [9] solves the resulting models. Due to the lack of an established benchmark suite for datapath structures, two non-standard circuits were selected as examples. In the context of the compactor, only regular datapaths are examined. Controller processing is left to other SDI components. To evaluate the quality of our regular approach and avoid inaccuracies due to different module generators libraries, all test circuits were entered manually. We compare the performance of our regularly compacted circuits against those obtained by the standard design implementation procedure using the Xilinx XACT PPR tool (irregular placement of flattened design).
UFC-A is part of an address generator for fast DES encryption. It was entered initially as 26 16-bit combinational modules. Regular compaction using MIS-PGA reduced the size from 368 to 96 LUTs. Irregular optimization and mapping of the flattened circuit by PPR yielded a reduction to 112 LUTs.
T16 is a 16-bit datapath consisting of two instances of a sample combinational module with a structure common to many bit-slices (shared control lines, vertical inter-slice signals). It is composed by stacking a single slice of sixteen 4-LUTs 4 times per module. For this benchmark, logic optimization or technology mapping was performed neither by SDI nor by PPR in order to directly compare the regular (SDI) to the conventional irregular placement (PPR).
TALU32 is a 32-bit ALU with registered inputs built by stacking eight 74181 [11] 4-bit ALU slices. The 74181 slice has been minimized and mapped for area efficiency from 65 nodes to 24 4-LUTs by MIS-PGA commands.
PPR was always run with maximum optimization (placer effort = 5) in performance-driven mode (dp2p, dc2p) with all pads floating. Both SDI and PPR placements were routed by PPR, also using maximum optimization (router effort = 4). The run-times in Table 2 were measured on an unloaded Sparc 20/71 workstation with 64MB RAM. Since the simulated annealing in PPR is non-deterministic, measurements are listed for the best and worst cases over a number of runs.
The two resulting layouts of T16 are shown in Figures 8(a) and 8(b) . Even at first glance, the SDI-placed solution is obviously more regular, since the natural structure of the datapath has been
