sidebar briefly reviews earlier data path synthesis methodologies.) Lacking synthesis tools for data paths, microprocessor designers often design the entire data path manually. 1 Consequently, data path circuit design takes considerable time and effort in the design cycle of a high-performance microprocessor. The increasing size and complexity of VLSI processors make automated synthesis of data path logic crucial to meeting design productivity goals.
We have developed a new methodology for fast, efficient synthesis of data path circuits. Our methodology automates various steps in the design process yet allows the designer to control the overall process. The methodology has several features unique to data path synthesis. First, it includes algorithms that extract and preserve regularity during the synthesis process, 2 unlike existing circuit synthesis techniques, which largely ignore data path regularity.
Second, our methodology efficiently incorporates floorplanning information for data path blocks and uses the resulting accurate interconnect-parasitic information to drive the gatesizing algorithms for data path circuits. This feature is significant: Unlike random logic, data path circuits can have large interconnect parasitics because some signals cross buses up to 64 bits wide. Incorporating the layout parasitics during circuit synthesis drastically reduces the time spent alternating between the circuit and layout design steps-an extremely timeconsuming task in manual design.
Finally, our synthesis methodology allows the use of a wide range of circuit technologies, such as the static CMOS, domino, pass-gate, and tristate-gate technologies. This makes our approach very suitable for high-performance processor designs, which often require the use of these technologies to achieve performance goals.
Proposed methodology
In manual data path design, the designer preserves regularity to facilitate incremental changes and improve the productivity of layout steps. Typically, the designer I analyzes the HDL description to understand the data flow and regularity of the data path and then partitions the description into regular subblocks; I often replaces the arithmetic operators, such as incrementers and adders, with their logic descriptions at the Boolean level; I draws a hierarchical schematic in terms of the regular subblocks, using a schematic editor; I maps the logic in the subblocks to appropriate cells from a data path cell library; I builds a floorplan and estimates the interconnect resistance and capacitance values; and I picks the appropriate sizes of the selected cells to optimize area, timing, and regularity, while accounting for the interconnect parasitics estimated from the floorplan.
Our synthesis methodology closely mimics the steps of manual design. Figure 1 (next page) diagrams the flow of our methodology. Each step speeds up the design process, but the designer can interact at each step and control the overall process. The input to our data path synthesis methodology is a behavioral or structural HDL description of a function block. The output is a synthesized data path circuit with a visual schematic and an associated floorplan for subsequent detailed layout. Using our methodology, designers iterate the following steps to meet design objectives: 
Previous work
Circuit synthesis for random control logic has been studied extensively.
1,2 Control circuit synthesis usually consists of two steps: technology-independent logic minimization followed by technology mapping to a cell library. A typical textbook approach to logic synthesis and technology mapping could destroy the data path's regularity and make it difficult for the designer to interact with the synthesis flow.
Some systems have attempted to exploit regularity in constructing macroblocks (adders and shifters, for example) to be used in high-level or RTL synthesis. For instance, the Synopsys Module Compiler 3 lets a designer construct and import a library of predesigned data path blocks (such as adders, multipliers, comparators, and shifters) into a high-level synthesis flow. High-level data path synthesis involves transforming the data path from behavioral to structural level. Various steps in high-level data path synthesis, such as scheduling and resource allocation, have been widely studied. 4 Although the Module Compiler neatly partitions the synthesis problem into macroblock design and high-level synthesis, this approach is inherently coarse-grained, using larger data path blocks instead of library cells. Therefore, depending on how closely the macroblock library matches the synthesized circuit, such solutions generally are worse than manual design in terms of area and performance.
On the other extreme, there are systems for layout design of data path circuits. For example, Arcadia Design Systems' Mustang and Sycon Designs' Tempest place and route a mapped and sized data path circuit schematic. However, we know of no system for synthesizing data path circuits from their structural hardware description language (HDL) descriptions that preserves data path regularity, adheres to a circuit floorplan, and generates circuits with near-manual quality.
hierarchy, which in turn generates a schematic view of the data path. The data path hierarchy also generates a floorplan view of the data path. Figure 2 shows an HDL description that could serve as input to our synthesis system; Figure 3 shows its logic diagram. We use this example in the following detailed description of the methodology.
Arithmetic operator expansion
To take advantage of the fine granularity of operations in data path circuits, our methodology expands arithmetic operations in terms of their logical equivalents. We refer to arithmetic operators as data path macros, or simply macros. If our methodology used predesigned circuits for data path macros (as does Module Compiler), we might end up with a circuit of inferior area and performance. Therefore, we replace the macros with equivalent implementations using Boolean operators. These implementations let the designer map logic from inside and outside the macro to the same data path library cell to optimize area and performance.
As Figure 1 shows, a library of commonly used data path macro topologies is available to the designer. A macro topology is the macro's implementation using Boolean operators. Designers can either select topologies from the library or provide their own topologies. They can change the topology at a later stage to improve circuit quality; however, judicious selection of macro topologies at this stage sig- 
Functional-regularity extraction
The most important feature of our data path synthesis approach is that it extracts and preserves regularity throughout the design process. We classify data path regularity as functional or structural. Two blocks are functionally regular if they contain identical logic, regardless of their connectivity to other logic in the data path. For example, the data path in Figure 3 has 16 functionally regular blocks, each composed of a 4-to-1 multiplexer, a 2-to-1 mux, and a latch. We restrict our definition of functionally regular blocks to those with identical HDL descriptions because checking for functional equivalence of two logic blocks is computationally expensive. We found that functionally regular data path blocks are typically defined in the same way in the HDL description, justifying our assumption. On the other hand, two blocks are structurally regular if they are controlled by the same con- trol signals, or if they are connected to the same data bus, regardless of their internal logic. We extract data path regularity at the functional or Boolean level in terms of master cells, which we call templates. A template is a data path subblock with multiple instances. We have developed efficient algorithms that generate a complete set of templates of a data path from its structural HDL description and then generate a "cover" of the data path using instances of a small set of templates. A cover is a set of template instances such that every logic block is included in one and only one template instance.
2 Figure 5 illustrates a cover of InstrPtr using instances of nine templates. Template 1 has the most instances (16), and templates 8 and 9 (three-and two-input AND gates) have just one instance each. The designer can control the template selection process by directing the algorithm to select large templates with fewer instances or smaller templates with more instances. For example, a template containing just the XOR logic has 16 instances, whereas larger templates, such as templates 2, 3, 4, and 5, have more logic but only four instances each. The designer can provide one or more templates, and the algorithm generates the rest; this feature is useful for incremental updating of the data path circuit.
Structural-regularity extraction
Instances from templates 2 through 5 generate the 16 bits of the SeqAddr bus in Figure 5 . Although these template instances are not functionally regular, they feed the same data path bus. Therefore, by our definition, these 16 instances are structurally regular blocks. We call a group of structurally regular blocks a data path vector, or simply a vector. A vector has important layout implications because all instances grouped in a vector are placed as a block in the layout. As a result, the layout maintains data path regularity.
We use a set of heuristics based on signal orientation (control versus data), bus naming conventions, and circuit connectivity to identify structural regularity. This step's output is a set of vectors that define a schematic and a floorplan of the data path. (page 96) shows the nine templates in Figure 5 grouped into five vectors using our concept of structural regularity. The two forms of regularity improve design productivity at different levels. Identification of functional regularity reduces the effort of synthesizing the templates first at circuit level in terms of library cells and then at layout level in terms of transistors. Identification of structural regularity aids in the placement and routing of the entire data path circuit.
Schematic and floorplan editing
A visual representation of data path circuits facilitates design changes and assists in design implementation consisting of multiple circuit styles. Our methodology uses the two views illustrated by Figure 6 . The first view is the schematic, which supports circuit design and performance verification. The schematic illustrates the data path hierarchy and the mapping of logic at the lowest hierarchical level (the template level) to cells from the data path library. The second view is the floorplan, which shows the data path in terms of the vectors formed after regularity extraction. The designer can edit the floorplan before estimating the interconnect parasitics of the signals going from one vector to another. The interconnect resistance and capacitance values are back-annotated to the schematic view and later used to select the appropriate library cell sizes.
95

November-December 2002
The problem of generating an optimal floorplan using a data path's vectors is equivalent to the problem of achieving an optimal linear arrangement of graph vertices, an NP-complete problem. Our objective is to optimize the overall interconnect length; an alternate objective would be to minimize the interconnect length of critical paths. We use two floorplanning algorithms. If the number of vectors is smallsay within 7 or 8-we use an exhaustive enumeration approach to find the optimal floorplan. Otherwise, we apply a heuristic of successive vector partitioning to a networkflow technique based on the max-flow-min-cut theorem. 3 Because linear floorplanning of data path vectors is more restricted than the traditional two-dimensional floorplanning problem, faster, more efficient techniques might optimize overall interconnect length, critical-path length, or both.
Library cell and cell size selection
The next step is to synthesize the templates using cells from the specified data path library, completing the data path circuit's schematic hierarchy. The data path library is a collection of cells of various circuit technologies, including static CMOS, domino, tristate gate and pass gate. Each cell realizes a small Boolean function at the circuit level and is available in a wide range of sizes-in other words, a wide range of drive capabilities. Because of the many technology choices, the task of selecting appropriate cells with appropriate sizes requires more design expertise than any other task we've described so far. We break the task into two steps: First, the designer synthesizes each template by selecting cells from the library and then chooses sizes for the selected cells to meet design objectives. The motivation for the two-step approach is that an expert designer can modify the selected cells before choosing their sizes and thus can control the data path synthesis process. The designer selects cells for a data path template according to the logic it implements and the load it drives. For example, as Figure 7a shows, we realized the XOR logic and the latch in template 2 in Figure 5 using a mux-latch library cell. This cell's function is Q = if EN then (if S0 then D0 else if S1 then D1). The ability to extract regular templates across the incrementer boundary lets us merge the XOR logic and the latch into a single cell (which we could not do in Module Compiler, which has a predesigned incrementer stored in a library), thus justifying the conversion of arithmetic logic to Boolean logic before synthesis. Similarly, Figure 7b shows the circuit for template 1 in Figure 5 using an active-low mux-latch cell and a static AND/OR/INVERT cell with function O = a.b + c.d + e.f + g.h. Figure 7c shows two circuits for template 7 in Figure 5 . The circuit with INV at the output is suited for the two instances of template 7 that drive a large load, and the NAND/NOR circuit works for the remaining instance (we can use the same cells for all the instances to maximize regularity). Thus, a single template in a data path can have multiple circuit realizations to optimize area or performance as opposed to regularity.
The final step is to choose appropriate sizes of the cells selected for each template. The objective is to optimize overall area, performance, power, and regularity. Circuit area is usually estimated as the sum of the transistor widths of all cells in the circuit. The dynamic power dissipation also depends on the total transistor width of all cells. The objective of the circuit-sizing problem addressed in the literature is to optimize area or power within the performance constraint on the circuit. The performance constraint is typically described as signals arriving at the required time at the primary outputs. Signal arrival time at a cell's output is defined in terms of signal arrival time at the cell's inputs and the resistance-capacitance load at the cell's output. (A more accurate model would consider signal transition times as well.) The RC load at a cell's output is the sum of source and diffusion capacitances of the cell itself, gate capacitances of the cells being driven, and interconnect capacitances (estimated using the data path floorplan).
Previous work exists on the problem of sizing a circuit's cells to optimize area, performance, and power, but none exists on optimizing regularity. A circuit optimized for area, performance, and power could end up with different cell sizes for instances of the same template. Optimizing for regularity might lose in terms of area, performance, or power but will gain incremental-update ease and, because a template is laid out only once for all its instances, layout productivity. 
Timing analysis and incremental changes
Designers can analyze the timing of the synthesized data path circuit with static timinganalysis tools such as the Synopsys PathMill. We found timing analysis to be the most timeconsuming part of data path synthesis; the process discussed so far took about 10 minutes for a small to medium data path containing 5,000 schematic transistors, whereas timing analysis alone took about an hour. Thus, there is a critical need for faster timing-analysis techniques for data path circuits. Researchers could develop such techniques by exploiting regularity. For example, Yalcin, Hayes, and Sakallah developed an approximate timing-analysis technique based on the observation that the propagation delays through a data path circuit are determined mainly by its set of control inputs. 4 If the data path circuit needs improvement, the designer must analyze the timing reports and make appropriate changes of the circuit hierarchy, cell selection, or floorplan. A major design change, such as an alternate topology for an HDL macro or a new set of templates, requires reiterating the entire synthesis flow. However, the designer can make minor changes of the schematic or floorplan, using the corresponding editors. We classify these incremental changes as cell selection, hierarchical, and floorplan changes.
Cell selection changes. To improve the circuit's performance, the designer might change the cell selected for the templates. For example, if the circuit going from the first set of latches in Figure 5 to the next set is slow, the designer can replace the AND/OR/INVERT gate used in template 1 with a much faster pass-gate mux. If the circuit for the incrementer's higher bits is slow, the designer can replace the NAND gate of template 7 with a domino gate.
Hierarchical changes. The schematic view of a data path circuit consists of a set of interconnected vectors, which recursively consist of smaller vectors or template instances. The designer can change the schematic hierarchy generated during the regularity extraction steps. The schematic editor facilitates hierarchical changes by providing commands, such as a command to merge two blocks and a command to decompose a block. For example, vectors 1 and 2 of the schematic in Figure 6a can be merged to form a single vector. The designer can alter the schematic hierarchy by repeatedly using the merge and decompose commands. Any changes in the schematic hierarchy go from the schematic editor to the floorplan editor, where the floorplan is updated accordingly.
Floorplan changes. The designer can change the floorplan, either to reduce interconnect parasitics on critical signals or to reduce overall interconnect length. For example, in the floorplan in Figure 6b , the designer could move vector 4 to the far left, since three 16-bit input buses are feeding this vector. New estimates of the interconnect parasitics, based on the updated floorplan, are back-annotated in the schematic view.
Experimental results
We used our methodology to synthesize four data path blocks of a recently introduced microprocessor. The four blocks are representative of data path blocks present in highperformance microprocessors. These circuits contain data path macros including a shifter, an adder, and an incrementer. The smallest block has close to 300 logic nodes, and the largest block has slightly more than 1,700. The number of transistors in the synthesized circuits varies from 4,700 to 8,600. Table 1 compares the synthesized data path blocks with the microprocessor's manually designed blocks in terms of area, performance, and regularity. Lacking a circuit layout, we estimated a circuit's area as the sum total of the transistor widths of all library cells used in the circuit. These blocks use a 0.18-micron CMOS process technology. The operating clock period of the manually designed blocks is 1.9 ns, which is the microprocessor's clock period. The performance figures of the synthesized blocks are within 10% of the desired clock period of 1.9 ns. The performance figures of blocks 1 and 2 are closer to the desired clock period than those of blocks 3 and 4 because we performed more design iterations for blocks 1 and 2.
The areas of blocks 3 and 4 are far less than those of the corresponding manual designs; however, we expect further design iterations on these blocks to improve the clock period at the expense of area. The total number of transistors differs greatly between the synthesized and manually designed circuits because the latter used predesigned circuits for incrementers and other macros (similar to the Module Compiler approach). Our methodology flattens these macros and identifies the templates, which are then mapped to the library cells.
The regularity index in Table 1 characterizes each block's regularity and the level of design and layout effort it requires. A circuit's regularity index is the area of all its templates as a percentage of the total circuit area. 2 Here, we assume the area of a template is the sum of the transistor widths of all its cells. A highly regular data path circuit has a small regularity index; a less regular circuit has a high regularity index. The regularity index correlates with a reduction in design effort. A small regularity index implies that a small effort is required for layout of the circuit's templates, assuming that a template is laid out only once for all its instances.
As we expected, the synthesized circuits' regularity indexes are smaller than those of the manual circuits because the manual designers had to compromise regularity to meet the tight 1.9-ns performance target. We infer that the regularity index is inversely related to the clock period. The synthesized circuits are within 10% of the manually designed circuits in terms of clock period. It appears that a few more design improvements would bring the synthesized circuits' performance close to that of the manually designed circuits, although some regularity and density would be lost in the process. Manual design --85
November-December 2002
I Changing cell selection requires design expertise to analyze the timing report and to choose cells from the best-suited technology. Such expertise could be incorporated in a knowledge-based system that would choose an alternate set of cells. An automated sizing tool that would incrementally choose appropriate cell sizes is also needed. I We have introduced regularity as a design metric for data path circuit design, but the problem of selecting cell sizes to optimize regularity still must be addressed. I Finally, we need fast, yet exact, timing-analysis methods for data path circuits. These methods could exploit regularity to reduce the duplication in effort of standard timinganalysis engines.
Our results show the feasibility of our data path synthesis methodology. Bolstered by these additional tools and techniques, a system based on this methodology can completely replace the time-consuming and tedious task of manual data path design. I
