In this paper, we present a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip floorplanning for area and timing improvements. We develop a timing-driven design flow to exploit the interaction between HDL synthesis and physical design tasks. During each design iteration, we resynthesize soft macros with either a relaxed or a tightened timing constraint which is guided by the post-layout timing information. The goal is to produce area-efficient designs while satisfying the timing constraints. Experiments on a number of industrial designs have demonstrated that by effectively relaxing the timing constraint of the non-critical modules and tightening the timing constraint of the critical modules, a design can achieve 13% to 30% timing improvements with little to no increase in chip area.
Introduction
Over past decades, academia and industry have invested much effort in physical design related research, including floorplanning, partitioning, placement, and routing. Several excellent reviews of physical design techniques are given by [l, 2, 31 . By integrating various techniques, many design methods and software systems have been developed for chip designs. One of the most popular design methods uses schematics as the design entry, followed by floorplanning, placement, and routing to produce final chip layouts. This design method is very effective and efficient on small to medium-scaled designs. However, with the advent of deep-submicron technology, more and more devices can be packed into a very complex single chip. Due to the time-to-market pressure of designing complex chips and the maturity of synthesis tools, more and more integrated-circuit designers use an HDL-based synthesis approach to de- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana (91999 ACM 1-581 13-092-9/99/0006..$5. 00 velop and manage large designs. Furthermore, as devices geometries shrink, a new set of design challenges, especially in electrical characteristics of circuits, are faced by integrated-circuit designers. This has led to a new research direction in design automation at synthesis and physical levels.
A typical HDL-based design flow involves multi-level design tasks. Over the years, much effort has been invested to improve the quality of design tasks at each design level. Few studies have been conducted to investigate the interaction between different design tasks.
Pedram and Bhat [4] presented technology mapping
techniques by considering net lengths for area and delay optimization. Liu et al.
[5] presented a resynthesis technique that resynthesizes the most congested region of the chip to reduce routing area. Stenz et al. [6] proposed a timing-driven placement method in interaction with netlist transformations. The netlist transformation procedure is integrated into the placement process so that accurate delay models are available to guide the transformation process. Their results shoded that delay reduction is achieved with almost no inkrease in chip area. Holt and Tyagi [7] proposed an integrated approach that incrementally develops a placement during the logic synthesis process for power minimization.
In this paper, we present a complete chip design method which incorporates a floorplanning-guided softmacro resynthesis method for area and timing improvement. The main objective is to develop a timingdriven design flow by exploiting the interaction between HDL synthesis and floorplanning design tasks. Experiments on a number of industrial designs have been conducted to demonstrate the effectiveness of the proposed method.
Problem Description
Figure 1(a) shows a typical HDL-based chip design flow. It consists of five steps: (1) HDL synthesis, (2) floorplanning, (3) place and route, (4) back annotation, and (5) post-layout timing analysis. The inputs to the design flow is a mixed RTL and gate-level HDL description in Verilog or VHDL, and a timing constraint. In the first step, a synthesizer converts an HDL design description into a hierarchical gate-level netlist by performing HDL compilation and a series of RTL and logic synthesis tasks. In the second step, a floorplanning procedure is invoked to determine the location of each macro on the layout plane. In the third step, a placement-and-routing procedure is used to perform detailed gate-level placement and routing. In the fourth step, the layout parasitic information is extracted. Finally, a post-layout timing analysis procedure is performed to determine the most critical paths and their delays. If the timing does not satisfy the design requirement, a refinement iteration will proceed until the timing requirement is satisfied. The refinement procedure can be applied at different design levels. This motivates us to investigate how to develop a complete chip design methodology by integrating multi-level design tasks and exploiting the interaction between them.
In this study, we focus on developing a complete chip design methodology which incorporates a soft-macro resynthesis method in interaction with chip floorplanning for area and timing improvements. The main objective of this research is to develop a timing-driven design flow by exploiting the interaction between HDL synthesis and physical design tasks, as depicted in Figure l(a) . Consider an example which consists of five macros, two hard macros and three soft macros. Initially, each soft macro is synthesized into a gatelevel netlist. After the floorplanning, place-and-route, and post-layout timing analysis, there are two possible cases. Figure l thesize S M 2 with a tightened timing constraint which can produce a timing-violation free design but costs some area overhead, as depicted in Figure l (e). The goal is to produce the most area-efficient design while satisfying the timing constraints. Figure 2 depicts the proposed design flow which consists of seven steps: (1) HDL synthesis, (2) pre-layout timing analysis, (3) soft-macro formation, (4) floorplanning, (5) place and route, (6 post-layout timing sign flow is an RTL design description in Verilog. In the first step, an HDL-based synthesizer converts the Verilog design description into a hierarchical gate-level netlist. In the second step, a timing analysis procedure is applied to perform pre-layout timing analysis of the design. A set of critical paths will be identified and used to guide the following macro-clustering, floorplanning, and placement-and-routing procedures. In the third step, the system groups soft macros connected to the same clock sources into the same cluster. Furthermore, it also groups small subcircuits to form large macros and decomposes extremely large macros into smaller ones. In the fourth step, we use a commercial floorplanner to perform macro floorplanning to determine the locations of macros. In the fifth step, we use a commercial tool to perform placement and routing tasks. In the sixth step, a post-layout timing analysis procedure is invoked to compute the final timing of the design. If there exits a timing violation or there is a chance for area reduction, a soft-macro resynthesis procedure is invoked. The system iterates four to the final step until all the timing constraints are satisfied and no more area improvement can be achieved.
The Proposed Method

Overview
In the following sections, we will describe the softmacro formation (step 3) and the soft-macro floorplanning and resynthesis loop (steps 4-7) in details.
Soft-Macro Formation
There are two main considerations in soft-macro formation. First, in many of today's applications, such as multimedia chips, designs usually have multiple clock analysis, and (7) resynthesis. T h e input to the de-sources with different rates. It is beneficial to group soft macros associated with the same clock source into the same cluster. Second, using an HDL-based synthesis method, the synthesized subcircuit of each leaf module is naturally a closely-connected cluster. However, a design may also contain extremely large modules containing tens of thousands of gates. This is undesirable because a large cluster is too rigid for macro placement and may often result in poor placement results. Furthermore, a design may also contain a large number of small subcircuits. This is also undesirable because a large number of macros will increase the computational complexity of the macro-cell placement process.
The soft-macro formation procedure consists of three steps: (1) clock-based clustering, (2) large-macro decomposition, and (3) small-macro clustering. In our approach, we first use a commercial synthesis system to convert a Verilog design description into a hierarchical gate-level netlist. We then construct an HDL-based structural tree to represent the structural hierarchy of the Verilog design description. In an HDL structural tree, the root node represents the top design, and each intermediate node represents a module construct. Each leaf node represents a circuit block generated from a leaf module.
After constructing the HDL structural tree of a design, we first groups the macros connected to the same clock source into the same cluster. Then we determine the large-macro candidates which need to be decomposed into smaller ones. The selection of large-macro candidates is based on the size of the macros. We define the threshold value Mth of a large-macro candidate where Savg is the average macro size # , '~~c~~f ' d , #Total cells and #Macros are the total number of cells and the number of soft macros in the design, respectively, and k is a user-defined threshold parameter for controlling the size of the large-macro. If a macro is larger than M t h , then it is selected as a largemacro candidate. For each large macro, we use the FM * partitioning method [8] to recursively decompose large macros into smaller clusters. Finally, we use a clustering algorithm [9] to group small macros into large ones based on the size constraint, and the criticality and connectivity between macros. Let G = {V, E } be the connected graph where V is the set of macro nodes and E the set of edges.
An edge eij exists if there exists at least a signal flow between macros vi and vj. A weight is associated with each edge indicating the number of connections between two corresponding macros. We define the connectivity Connij, the criticality Criti ', and the closeness Cij of two macros, vi and vi, as below. where wi denotes the total connection weight of vi, wij the total connection weight between vi and vj, si the size of vi, S t h the upper bound on the size of a cluster set by the user, Crit-Path(vi w vj) denotes that there is a critical path traveling across vi and vj, and CY and / 3 are two coefficients set by the user. In order to eliminate small macros and prevent the formation of large clusters, the user can set the upper bound on the size of a cluster. When the size of a new macro formed by merging two macros is larger than the upper bound, the closeness value between these two macros is 0.
Soft-Macro Resynthesis in Interaction
After forming soft-macro clusters, a timing-driven floorplanning procedure [lo] is invoked to determine the relative location of each macro (hard and soft macros) on the layout plane. We then use a commercial tool to perform placement and routing design tasks.
Subsequently, we back-annotate RC parasitic values of the layout and perform post-layout timing analysis. Finally, we determine whether some soft macros can be resynthesized to achieve timing and/or area improvements.
The key issues for the resynthesis process are twofold. First, how to determine which soft macro should be resynthesized. Second, if a soft macro needs to be resynthesized, to what extent can its timing constraint be relaxed or tightened. Our resynthesis procedure consists of twp steps: (1) slack computation and (2) soft-macro resynthesis candidate selection.
In the first step, we start by back-annotating the delay information for each 1 / 0 port of soft macros. The delay information is extracted from a post-layout timing report. We then compute the slack value for each inter-macro signal path. Finally, we assign a slack value for each 1/0 port of soft macros. This value is computed using the following formula: Slack(pi) =
MIN{Slack(e(pi,pj)),pi E SMk and p j E S M l } ,
where e(pi , p j ) denotes the interconnection between ports pi and p j , and SMk and SMi denote two soft macros.
The slack of an 1/0 port is defined as the minimum slack value of all the signal paths associated with this 1/0 port.
In the second step, we use two cost functions to determine which soft macro should be resynthesized next so that maximal area and/or timing improvement If there exists a negative slack value associated with any soft macro, a timing violation occurs. In this case, we select the soft macro with the highest N E G ( S M i ) as the candidate for resynthesis because it should be the most critical one. If all timing satisfies the timing constraint, we select the soft macro with the highest
P O S ( S M i ) as the candidate for resynthesis because
resynthesizing it with relaxed timing constraints should result in a maximal area reduction. After selecting a candidate, we use a commercial synthesis tool to resynthesize the soft macro by specifying the I/O-ports' timing constraints according to their slack values. Subsequently, we invoke a floorplanning procedure to adjust the chip floorplan by preserving the original relative locations of all soft and hard macros.
The proposed timing-driven soft-macro resynthesis method (TDSR) is described below:
Procedure-TDSR( Dhdr ,Tconst) Postlayout-timinganalysis( Dg,t,); Slack-computation(Stree,Dgat,); end-of -while end-of procedure
The inputs to the system include an HDL design description ( D h d l ) and timing constraints (Tconst). Let Dgate and St,.,, denote the gate-level design and structural tree. Initially, the system performs HDL synthesis, pre-layout timing synthesis, structural-tree construction, soft-macro formation, floorplanning, placeand-route, and post-layout timing analysis to produce an initial chip layout. During the resynthesis iteration, the system first computes the slack value for each soft macro 1/0 port, and then computes for each soft macro. Following, the system selects one soft-macro candidate which contributes the most in timing or area improvement. After resynthesizing the soft macro, the system performs floorplanning adjustment, followed by RC parasitic extraction and post-layout timing analysis. Finally, if there is improvement, then the resynthesis iteration continues. Otherwise, the system stops and reports the final chip layout.
Experiments
We have tested the proposed method on three industrial designs. All three designs are described as hierarchical, mixed RTL and gate-level netlists in Verilog. Table 1 shows the characteristics of the designs in which Nets, IOs, H M s , SMs(Be fore/After), SMs(Cells/Gates), and Total Gates denote the number of nets, IO pins, hard macros, soft macros before and after performing clustering, cells/gates of soft macros, and total gate-count of the design. In the three designs, indl contains three clock sources, and ind2 and ind3 contain two clock sources. In all experiments, we set the threshold values k = 2 for large-macro decomposition and Sth = 0.1 Savg for small-macro clustering. Figure 3 shows the experimental flow. In the first step, we used Synopsys' Design Compiler [ll] to convert the input Verilog design description into a hierarchical gate-level netlist and then performs timing analysis to report the 200 most critical paths. In the second step, we used our proposed soft-macro formation method as a pre-processing step to generate soft-macro clusters for the floorplanner. In the third step, we used Cadence's Silicon Ensemble [12] to perform chip floorplanning and determine hard-macros' locations. In the fourth step, we used the performance-driven soft-macro placement algorithm [lo] to perform softmacro placement. In the fifth step, we used AVANT!'s Aquarious X O [13] to perform detailed placement and routing. In the sixth step, the AVANT!'s STAR -R C
[14] was used to extract layout parasitic information.
In the seventh step, we used AVANT!'s STAR -DC tool [15] to perform delay calculations and generate an SDF file. In the eighth step, we used Synopsys' Design Time [ll] to perform post-layout timing analysis. Finally, we applied the proposed soft-macro resynthesis iteration to incrementally improve the area and timing of the layout. During each resynthesis iteration, we first used Synopsys's Design Compiler to perform logic resynthesis by supplying a relaxed or a tightened timing constraint to the soft macros. We then applied an ECO function supported by AVANT!'s Aquarious X O to perform placement and routing. For all experiments, we provided the floorplanner (the third Designs ,Indl Ind2 Ind3
The initial critical path of ind2 using the 0 . 5 p m library.
step) and the placer-and-router (the fifth step) with the most critical 200 paths (generated in the first step) as the timing constraints.
We have conducted two sets of experiments. In the first experiment, we used the TSMC 0.5pm cell library [16] . In the second experiment, we used the TSMC 0.25pm cell library [17] . Note that i n d l and ind3 contain a PLL module. Unfortunately, the 0.25pm-based PLL module is not available and we could not perform the experiment on the both designs using the TSMC 0.25pm cell library. Hence, in this paper, we only report the result of ind2 using the TSMC 0.25pm cell library. Table 2 shows the area-delay comparisons of i n d l using the 0.5pm library, in which I O denotes the number of IO pins, # H M the number of hard macros, # S M ( B / A ) the number of soft macros before and after applying the soft macro clustering, GatesM the total gate count of soft macros, GateT,t the total gate count, Area the chip area, Delay the worst path delay, T,,, ,, the resynthesis run times in hours, and T,,, the E 8 0 run times. The results show that by resynthesizing some soft macros, the timing was improved up to 20% with almost no area penalty. Table 3 shows the areadelay comparisons of ind2 using the 0.5pm library. We obtained the same result as that of i n d l , in which the timing was improved up to 13% with almost no area penalty. Table 4 shows the area-delay comparisons of ind3 using the 0.5pm library. The results show that the timing was improved up to 11% with almost no area penalty. Figure 4 shows the critical path before resynthesis (Iteration 1 in Table 3 ) . After two resynthesis iterations, the new critical path is shown in Figure 5 Nets 
The new critical path of ind2 using the 0 . 5 p m library after two resynthesis iterations.
(Iteration 3 in Table 3 ) . Table 5 shows the area-delay comparisons of ind2 using the 0.25pm library. The results show that the timing was improved up to 30% with 11% area penalty. We have also compared the average delays contributed by gates and interconnects using 0.5pm and 0.25 p m technologies. Table 6 shows the average gate and interconnect delay comparisons of the most critical paths of ind2. The results show that using the 0.5pm technology the average gate's intrinsic delay and interconnect delay are 0.171ns and 0.277nsl respectively. In addition, using the 0.25pm technology, the average gate's intrinsic delay and interconnect delay are 0 . 1 0 7~1~ and 0 . 3 2 5~~ respectively. From the results, we observed that the average interconnect-delay vs. gatedelay ratios of the 0.5pm and 0.25pm technologies are 1.62 and 3.04. This indicates that interconnect delays play an important role in deepsubmicron technologies.
From the experiments, the following observations can be made. When using the 0.5pm library for designs i n d l , ind2, and ind3 our proposed method can improve timing from 11% to 20% with almost no area penalty. This demonstrates that by effectively relaxing the timing constraints of non-critical modules and tightening the timing constraints of the critical modules we can achieve significant timing improvements with little to no increases in chip area. When using the 0.25pm library, our method can improve timing by 8%, 22%, and 30% with 2%, 5%, and 11% increase in chip area, respectively. We found that the 0.25pm library supports a large set of components with a wide range driven capability. This feature provides more design alternatives during the synthesis process. The experiments were conducted on an HP-C180 workstation with 750Mb main memory. Tables 3-6 show the run times for the resynthesis and P&R ECO iteration. For example, in the first iteration of the indl design ( Table 2) , it took an average of 6 hours and 4 hours to run the synthesis and P&R ECO tasks. 
Conclusions
In this paper, we have presented a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip floorplanning for area and timing improvements. We have conducted a series of experiments on three industrial designs. The results have demonstrated that by effectively relaxing the timing constraints of non-critical modules and tightening the timing constraints of critical modules we can achieve significant timing improvements with very little to no area penalty.
In this study, we have shown that an integrated synthesis, floorplanning, placement, and routing design flow allows designers to perform design resynthesis and ECO-based placement-and-routing guided by accurate layout timing information. This method is very effective for timing improvement with very little to no increase in chip area. One drawback for such a design flow is that it is an extremely time-consuming task. It takes close to 1 full-day to run one resynthesis iteration. Shortening the iteration time will be a key factor in improving the design exploration process. One possible approach is to move the iteration loop to a higher-level, such as floorplanning level. In order to make this h a p pen, a more accurate delay and area estimation method is required. Another important issue is how to determine the initial timing budget for each module before synthesis. Good initial time-budgeting should shorten the number of resynthesis iterations and thus speed up the entire design process.
