Large circuits, whether they are aritlieietic, digital sigiinl processing, switchirig, or processors, typically contain a greater portion of highly regular datapath logic. Datapath synthesis algorithnzr presen'e these regulor struclures, so they can be exploited by packing, placement, and muting tools for speed or dmsity. Typical datapath synthesis algorithms, howevez sacrifice area to gain regularity. Current algorithms can have as muck as 30% to 40% arra inflatioii when compared with traditional frat synthesis algorithms. This paper describes a datapath synthesis algorithm with very low area overhead, which is an enhancement to the module compacrion algorithm proposed in [SI. We propose two word-level optimizations -mulriplexer free collapsing and operarion reordering. They reduce the area inflation to 3 % 4 % as compared with frat synthesis. Our synthesis results also retain significant amount of regularity from the original designs.
Introductioii
As FPGAs are used to implement ever-larger applications, it has become compelling to complement the traditional flat synthesis technology with more advanced datapath synthesis techniques. Although flat synthesis is ideal for small contwl-logic type circuits, it is not efficient for larger circuits, which typically contain a,greater portion of datapath logic [Ill. Whether it is arithmetic. digital signal processing, switching, or processors, datapath logic has highly regular structures. These structures are usually destroyed during flat synthesis. Datapath synthesis algorithms, on the other hand, preserve the regularity. so it can be exploited by packing, placement, and routing tools to achieve greater speed or density. By preserving regularity, datapath synthesis also preserves cany chains, which are specially supported by many commercial FPGAs. From a user perspective, FPGA users are accustomed to fully automated design flows. Since datapath synthesis significantly 0-7803-7574-2/02/517.00 Q2002 IEEE increases the level of automation for datapath design, it is particularly suitable for FPGAs. For these reasons, there has been increased interest in implementing efficient datapath synthesis for FPGAs.
Previous studies, (41 [71 [XI [91 [I21 [I31 (161, have shown that the logic density of FPGAs can be substantially increased by exploiting regularity at the placement, routing, and architecture levels. However, there arc no extensive studies focusing on the effects of datapath synthesis on FPGA area. Existing datapath synthesis techniques can be roughly classified into four categories: regularity preserv- Among these four synthesis techniques, only regularity preserving logic transformations do not incur significant area overhead [IO] . However, their goal is to extract regularity from flattened datapath logic, rather than preserving a given hierarchy. As the result, their effectiveness is limited by the amount of regularity that can be discovered by the extraction process.
Hard In this paper, we present an enhanced module compaction algorithm, which is augmented with two word-level transformations (by word-level we mean operations that optimize across multiple bits of datapath) -multiplexer tree collapsing and operation reordering. Currently. these two word-level transformations are performed manually, but the algorithms presented here can be easily automated.
Unlike Koch's algorithm that uses placement information to selectively merge datapath modules, our module compaction algorithm does not require placement information. Instead, we merge modules together based on intermodule connectivity. As the result, our algorithm can be more easily intcgrated into existing CAD flows. Our enhanced algorithm is shown, empirically. to be able to preserve regularity while incurring an average area overhead of only 3 % 4 % versus flat synthesis.
In the next section, we describe our synthesis flow in detail. Section 3 presents the experimental results on a series of 15 benchmark circuits, comparing flat. hardboundary hierarchical, and our enhanced module compaction synthesis. We also show that our synthesis maintains various regular structures from thc input netlists. The effect of synthesis granularity on I J J T count inflation is also presented. We conclude in Section 4.
Enhanced Module Compaction Algorithm
This section describes our datapath synthesis algorithm in detail. It first gives an overview of the input representation and the overall flow of our algorithm. It then discusses each synthesidoptimization step in detail.
Datapath Circuit Representation
The input to the synthesis algorithm is a netlist of datapath components, described in VHDL or Verilog, which we call the top-level netlist. All datapath components used in the netlist are instantiated from a predefined datapath component library. m i s library contains fundamental datapath building blocks such as multiplexers, adderslsubtracters, shifters, comparators, and registers.
These datapath components are in tum composed of bitlevel structures that we call bit-slice nerlists. A bit-slice instantiated multiple times and all its instantiations are interconnected into another netlist that describes the function and structure of the datapath component. We call this netlist the datapath component level netlist.
The number of bit-slice netlist instantiations corresponds to the width of the datapath. All instantiations are assigned a unique bit-slice number from one to the width of the datapath with the least significant bit-slice labeled one.
An example of a datapath component is shown in Figure   1 . This datapath component is a 4-bit ripple carry adder.
The bit-slice netlist of this datapath component is a netlist of logic gates defining a full adder. This design is instantiated four times to form the 4-bit adder.
Synthesis Overview
The overall synthesis flow is shown in Figure 2 . The flow consists of four major stages. First. the top-level netlist is passed through a three-stage optimization process where new datapath components are created by transforming and merging bit-slice netlists. During the optimization process, some logic will be created which does not belong to specific bit-slices, for example. logic generating signals that fan out to several bit-slices. This is called irregular logic (to distinguish it from logic that fits nicely into a datapath) and is represented directly as logic gates in the top-level netlist. Each distinct optimization type is discussed in the sections below.
After the three-stage optimization, each bit-slice netlist is synthesized and mapped into 4-input lookup tables (4-LUTs) and D-type Flip-Flops without set and reset signals using a traditional flat synthesis algorithm. The irregular logic gates are also synthesized and mapped into LUTs independently from datapath components using the same flat synthesis algorithm.
Word-Level Optimization
The first set of optimizations that we perform are wordlevel optimizations. Two types of word-level transformations are performed. One is used to extract common subexpressions across bit-slice boundaries. The other uses operation reordering to reduce area. Currently, these two optimizations are performed manually. Their algorithms, which are suitable for automation, are presented here.
Each datapath component represents a set of arithmetic operations. In a top-level netlist, datapath components are connected together to form mathematical functions. Each of these functions has multiple bit outputs, where the output bits can be individually described using logic expressions. Often, common sub-expressions exist across these logic . expressions.
More precisely, let both We call g(x) the common sub-expression of fdx),f,(x), _.. , f;,(x). The implementation area of mathematical functions can be reduced by properly discovering and extracting these common sub-expressions so that they are only implemented once.
In a flat synthesis process, common sub-expressions are extracted through logic transformations. This extraction process usually destroys the regularity of datapath circuits. since flat synthesis independently transforms logic expressions one bit at a time. We have found that many of these common sub-expressions can be discovered at the wordlevel. Furthermore, datapath regularity can easily be preserved by extracting these common sub-expressions at the word-level where datapath stmctures remain clearly identifiable.
For our benchmarks, the most effective word-level transformation that extracts common sub-expressions is multiplexer tree collapsing. In a multiplexer tree, the multiplexers, their data inputs, outputs, and the interconnection signals form a tree topology. Each node of the tree. which has multiple inputs and a single output, represents a multiplexer. Each input of a node corresponds to a multiplexer data input. The output of a node corresponds to a multiplexer output. An edge in the graph represents a net connecting a multiplexer output to a multiplexer data input, a primary input, or the primary output of the multiplexer tree.
A multiplexer tree sometimes can be substituted by a single multiplexer, which requires much less logic to implement. An example is shown in Figure 3 . Here the multiplexer tree in the left circuit is substituted by a single multiplexer in the right circuit. To implement the two multiplexers and the and gate in the left circuit we need two 4-input LUTs for every bit-slice as indicated by the shaded regions in the figure. To implement the multiplexer and the and gate in the right circuit, we need only one 4-input LUT for every bit-slice. The extra random logic in the right circuit is the common sub-expression extracted by the transformation. It usually is shared by several bit-slices, so its areacost is small in wide datapath circuits.
The algorithm used to collapse multiplexer trees is as follows: First we identify multiplexer trees in the top-level netlist. This is easy to perform since the functionality of each datapath component is known. We then identify the total number of unique data inputs to each tree. We replace each tree by a single multiplexer whose width is equal to the number of unique data inputs of the tree. Each input of the new multiplexer is connected to a unique multiplexer tree primary data input. The output of the new multiplexer is connected to the primary output of the tree. Finally, the select signal of the new multiplexer is generated using the select signals of the original multiplexer tree. If the replacement reduces the area cost, it is retained. Otherwise, the replacement is rejected.
A second word-level transformation that we perform uses operation reordering to reduce area. In particular, the optimization reorders result selections into operand selections. Arithmetic operators such a s multiplications are. in general, much more expensive than multiplexers. In the event that several identical operations are performed on independent data sets and only one result is used, it usually is much cheaper to preselect the input data than to perform all operations in parallel and select the final results.
An example is shown in Figure 4 . Here the result of two addition operations is selected by a 2 1 mux. The operation can be more efficiently performed by preselecting adder inputs and using a single adder instead of two. Before optimization. five 4-input LUTs are needed to implement the function. After optimization, only four 4-input LUTs are needed to implement the same function. This optimization is not obvious at the bit-slice level. Since coutOa and coutOb appear to be two independent signals at this level. However, when viewed from the top-level netlist, the optimization is clearly identifiable. 
Module Compaction
In the second stage of optimization, we perform module compaction. Here we iteratively merge two connected bitslice netlists together to form a larger bit-slice netlist. Also, by creating larger bit-slice netlists, we create more optimization opportunities for the flat synthesis stage shown in Figure 2 , where synthesis is restricted to the boundaries of bit-slice netlists. This merging process is similar to the module compaction algorithm proposed by Koch in 181. Our algorithm differs from Koch's algorithm in its merging criteria; unlike Koch's algorithm, our algorithm does not depend on any placement information.
The basic merging operation is a pattern identification process. Two groups of bit-slices from two datapath m dules are merged if the following conditions are met:
1. These two groups contain equal numbers of bit-slices. 2. All bit-slices in each group have consecutive bit-slice numbers as defined in Section 2.1.
3. All bit-slices in one group are identically connected to their corresponding bit-slices in the other group. Here we define two corresponding bit-slices to be bit-slices from two distinct groups, each with the same offset from the lowest bit-slice number in its group. not include all the hit-slices of its datapath module, the remaining slices in the module are split into two modules -one module with all the bit-slices whose hit-slice numbers are smaller than the bit-slice numbers of the merging group. the other module with all the bit-slices whose bitslice numbers are larger than the bit-slice numbers of the merging group.
An example of module compaction is shown in Figure   5 . Here, we start with two modules. We impose two extra conditions to prevent a carry type signal from causing all bit-slices connected to it to be merged into a single module. For example, consider a second merging iteration on the circuit of Figure 5 after the initial merging described above. Module A will be qualified to he merged with the first slice of module B since they are connected by the carry signal. Then in the third merging iteration, module A and B will be completely merged into a single hit-slice. After two more iterations, the carry chain will cause A, B, and D in the figure to be merged into a single bit-slice, which completely destroys the regularity of our datapath.
To prevent this, first, we order merging operations, so that operations that will create the widest datapatb components are performed first. Second, for every hit-slice netlist we define an ancestors field, which is a set of hit-slice netlists. Initially each bit-slice netlist has only itself in its ancestors set. When two bit-slice netlists are merged, the ancestors set of the new hit-slice netlist is the union of the ancesfors sets of the two merging bit-slice netlists. If the intersection of the ancestors of two hit-slice netlists is not empty, these two bit-slice netlists cannot he merged together.
With the ancestors field, nothing can he merged during the second merging iteration in Figure 5 , since all mergahle Before Optimization After Optimization 
Bit-Slice Netlist YO Optimization
Each bit-slice netlist has a set of predefined I/O signals that enter and exit the netlist. Depending on the usage of these signals, some of them can be eliminated and converted into internal signals of the netlist. Since each hitslice netlist is flat synthesized in our synthesis flow, converting I/O signals into internal signals can reduce the implementation area of hit-slices by providing extra information to the flat synthesizer. In our optimization process, four types of bit-slice I10 signals are converted into internal signals of bit-slices. Each type is discussed below.
Before any I10 optimization is performed, each datapath component in the top-level netlist is first divided into m-hit wide subcomponents, where m is specified by the user. Each subcomponent is a self-contained datapath component with its own hit-slice netlist definition and a netlist of m instantiations of the bit-slice netlist. The division starts from the least significant bit of each datapath component and groups adjacent m bit-slices into a subcomponent. If the width of the datapath component is not an integer multiple of m. the subcomponent containing the most significant hits will be less than m-bits wide. We call m the granularity of the synthesis flow. A larger m preserves more datapath regularity at the expense of increased area, while a smaller m decreases area at the expense of preserving less datapath regularity. After division, each original datapath component in the top-level netlist is substituted by its corresponding subcomponents. The first type of VO optimization is constant absorption. When an input of a bit-slice netlist is always connected to the same constant value (either zero or one) for all instantiations of the netlist in a datapath component, we convert this input signal into a constant intemal signal of the netlist.
Before Optimization exponent-dp incmod mantissa-dp mullmod-dp The second type of I/O optimization is feedback absorption. When a connection exists between a bit-slice netlist input and a bit-slice netlist output for all instantiations of the netlist, we convert this input signal into an internal signal and reconnect it to the corresponding output inside the netlist.
An example of feedback absorption is shown in Figure  6 . Here Datapath Component A consists of four bitsliccs, which are all instances of the same bit-slice netlist. Since each of the slice inputs labeled Ail is connected to a corresponding slice output labeled Ao fmm the same slice, Ail is eliminated as an input of the bit-slice netlist and is converted to an internal signal. Ail is reconnected to Ao inside the netlist.
The third type of I/O optimization is duplicated input absorption. When two bit-slice netlist inputs are connected together for all instantiations of the design, we conven one of the input signals into an internal signal and reconnect it to the other input signal inside the netlist.
An example of duplicated input absorption is shown in Figure 7 . As before, Datapath Component A consists of four bit-slices, which are all instances of the same bit-slice netlist. Since each of the slice inputs labeled A i l is always connected to a corresponding slice input labeled Ai2 in the same slice, Ai2 is eliminated as an input of the bit-slice netlist and is converted to an internal signal. Ai2 is reconnected to Ail inside the netlist. The last type of I/O optimization that we perform is unused output elimination. When a bit-slice netlist output does not connect to any other signals in all instantiations of the bit-slice netlist, this output signal is converted into an internal signal of the bit-slice netlist.
Experimental Results
In this section, we present experimental results of applying the enhanced module compaction synthesis on fifteen datapath benchmarks. These fifteen circuits are fmm the Pico-Java processor [I]. Note that the word-level optimizations, described in Section 2.3, were performed manually. The other optimizations were done by automated algorithms implemented in the C-language. We used the Synopsys Design Compiler add F' F' GA Compiler 121 to perform flat synthesis. Unless specified otherwise, all the data presented here are synthesized using a granularity value (m). as defined in Section 2.5, of 4. For every benchmark circuit. we compared the final LUT and flip-flop count of our enhanced module compaction synthesis with the counts achieved by Synopsys flat synthesis. In order to assure that the best achievable flat synthesis results are used to compare with our synthesis, we use the best flat synthesis result from two flows: the flat synthesized input netlist, and the flat synthesized output netlist of our enhanced module compaction synthesis. In some cases, one flat synthesis flow offers slightly better results than the other. Table 1 summarizes the LUT and flip-flop inflation of each benchmark for flat synthesis, hard-boundary hierarchical synthesis. and our new enhanced module wmpaction synthesis. Each inflation figure is calculated by comparing the enhanced module compaction synthesis with the best flat synthesis. The formula, sink datapath component dcu-dpath ex-dpath icu-dpath imdr-dpath pipe-dpath DA FA inflation = --I , is used to calculate the inflation for both LUTs and flip-flops. In the formula, DA represents the datapath-oriented synthesis area; FA represents the flat synthesis area. rssdd-dp 722 I 52% I
21%
, Total I ,37152 1 45% 1
35%
Column one of the table lists the name of each benchmark circuit. Columns two and three give the LUT and flip-flop count of each circuit from the best flat synthesis.
Columns four and five give the inflation figures of hardboundary hierarchical synthesis. Here, synthesis is performed without any of the optimizations described in Section 2. The inflation figures of enhanced module compaction with full optimization are listed in columns six and seven. The average LUT inflation without optimization is 38% and the average flip-flop inflation is 0.73%. With the optimizations. the average LUT inflation is reduced to 3.2% and the flip-flop inflation is zero. These numbers show that our algorithm does not significantly increase the LUT and flipflop count for these benchmarks and is much more area efficient than hard-boundary hierarchical synthesis. For the circuit, rsadd-dp, our synthesis even discovered more optimizations than flat synthesis, resulting in much smaller area.
We now present measurements of various aspects of the datapath regularity of the circuits after enhanced module source datapath component After synthesis, 94% of the LUTs remain in datapath components, while only 6% of the logic resides in irregular logic. This shows that our synthesis flow preserves regularity for logic blocks. We also measured the regularity of nets after synthesis. A two terminal bus is defined as an m-bit wide bus (4 in this table) that connects one datapath component to another and obeys the following two conditions: First, each bit of the bus must be generated by a distinct bit-slice in the source datapath component and absorbed by a distinct bitslice in the sink datapath component. Second, the source bit-slice and the sink bit-slice must have the same bit-slice number. The topology of a 4-bit wide bus is shown in Figure 8 . On average 48% of two terminal connections in these benchmarks can be grouped into 4-bit wide busses. The percentage number for each benchmark is summarized in column three of Table 2. A control net is a single net that enters a datapath component and fans out to all m bit-slices (4 in this table). The topology of a 4-bit control net is shown in Figure 9 . The control nets on average consist of 35% of the total two terminal connections in these benchmarks. The detailed percentage number for each benchmark is shown in column four of Table 2 .
Overall, there are 83% of two terminal connections that belongs to either a bus or a control net. There are few two terminal connections that belong to both a bus and a control net at the same time. 3.5 4.6 6.3 6.7 6.5 6.7 6.8 7.4
Finally. Table 3 presents LUT count inflation as a function of ni. DFF count did not increase with increasing m.
Here we see that the LUT inflation increases from 3.5% to 7.46 as m. the granularity of synthesis. is increased from 4 to 32. The cause of ihis increase is the less efficient 1 1 0 optimization as described in Section 2.5.
Conclusion
This paper presented an enhanced module compaction synthesis algorithm targeting FPGAs. We empirically demonstrated that our datapath-oriented synthesis is nearly as efficient as the regular flat synthesis. In terms of LUT count, our algorithm produces circuits on average with only 3e-88 LUT count inflation and no increase in regisfer count. We also measured the regularity of the fifteen benchmark circuits. We found that there is a high degree of regularity in these synthesized benchmarks, with 48% of two terminal connections that can be grouped into 4-bit wide busses and 35% of two terminal connections from highly regular control signals with at least 4-bit fan-out.
