In this paper, we present a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip oorplanning for area and timing improvements. We develop a timing-driven design ow to exploit the interaction between HDL synthesis and physical design tasks. During each design iteration, we resynthesize soft macros with either a relaxed or a tightened timing constraint which is guided by the post-layout timing information. The goal is to produce area-e cient designs while satisfying the timing constraints. Experiments on a number of industrial designs have demonstrated that by e ectively relaxing the timing constraint of the non-critical modules and tightening the timing constraint of the critical modules, a design can achieve 13 to 30 timing improvements with little to no increase in chip area.
Introduction
Over past decades, academia and industry have invested much e ort in physical design related research, including oorplanning, partitioning, placement, and routing. Several excellent reviews of physical design techniques are given by 1 , 2 , 3 . By integrating various techniques, many design methods and software systems have been developed for chip designs. One of the most popular design methods uses schematics as the design entry, followed by oorplanning, placement, and routing to produce nal chip layouts. This design method is very e ective and e cient on small to medium-scaled designs. However, with the advent of deep-submicron technology, more and more devices can be packed into a v ery complex single chip. Due to the time-to-market pressure of designing complex chips and the maturity of synthesis tools, more and more integrated-circuit designers use an HDL-based synthesis approach to de- velop and manage large designs. Furthermore, as devices geometries shrink, a new set of design challenges, especially in electrical characteristics of circuits, are faced by i n tegrated-circuit designers. This has led to a new research direction in design automation at synthesis and physical levels.
A t ypical HDL-based design ow i n v olves multi-level design tasks. Over the years, much e ort has been invested to improve the quality of design tasks at each design level. Few studies have been conducted to investigate the interaction between di erent design tasks. Pedram and Bhat 4 presented technology mapping techniques by considering net lengths for area and delay optimization. Liu et al. 5 presented a resynthesis technique that resynthesizes the most congested region of the chip to reduce routing area. Stenz et al. 6 proposed a timing-driven placement method in interaction with netlist transformations. The netlist transformation procedure is integrated into the placement process so that accurate delay models are available to guide the transformation process. Their results showed that delay reduction is achieved with almost no increase in chip area. Holt and Tyagi 7 proposed an integrated approach that incrementally develops a placement during the logic synthesis process for power minimization.
In this paper, we present a complete chip design method which incorporates a oorplanning-guided softmacro resynthesis method for area and timing improvement. The main objective is to develop a timingdriven design ow b y exploiting the interaction between HDL synthesis and oorplanning design tasks. Experiments on a number of industrial designs have been conducted to demonstrate the e ectiveness of the proposed method. Figure 1a show s a t ypical HDL-based chip design ow. It consists of ve steps: 1 HDL synthesis, 2 oorplanning, 3 place and route, 4 back annotation, and 5 post-layout timing analysis. The inputs to the design ow is a mixed RTL and gate-level HDL description in Verilog or VHDL, and a timing constraint. In the rst step, a synthesizer converts an HDL design description into a hierarchical gate-level netlist by performing HDL compilation and a series of RTL and logic synthesis tasks. In the second step, a oorplanning procedure is invoked to determine the location of each macro on the layout plane. In the third step, a placement-and-routing procedure is used _ ___________________________ Permission to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana (c) 1999 ACM 1-58113-109-7/99/06..$5.00 to perform detailed gate-level placement and routing.
Problem Description
In the fourth step, the layout parasitic information is extracted. Finally, a post-layout timing analysis procedure is performed to determine the most critical paths and their delays. If the timing does not satisfy the design requirement, a re nement iteration will proceed until the timing requirement is satis ed. The re nement procedure can be applied at di erent design levels. This motivates us to investigate how t o d e v elop a complete chip design methodology by i n tegrating multi-level design tasks and exploiting the interaction between them. In this study, w e focus on developing a complete chip design methodology which incorporates a soft-macro resynthesis method in interaction with chip oorplanning for area and timing improvements. The main objective of this research is to develop a timing-driven design ow b y exploiting the interaction between HDL synthesis and physical design tasks, as depicted in Figure 1a . Consider an example which consists of ve macros, two hard macros and three soft macros. Initially, each soft macro is synthesized into a gatelevel netlist. After the oorplanning, place-and-route, and post-layout timing analysis, there are two possible cases. Figure 1b shows the rst case in which the design satis es the timing constraint and the critical path occurs between soft macros SM1 and SM2. Consider that the slack b e t w een SM3 and fSM1,SM2gis larger than zero. This indicates that we m a y h a v e provided an excessive timing constraint t o SM3 during the synthesis process. In this case, we can resynthesize SM3 with a relaxed timing constraint which usually results in a more area-e cient design, as depicted in Figure 1c . Figure 1d shows the second case in which a timing violation occurs between SM1 and SM2. This indicates that we m a y h a v e provided an under-estimated timing constraint to either SM1 o r SM2 during the synthesis process. In this case, we m a y h a v e to resyn- thesize SM2 with a tightened timing constraint which can produce a timing-violation free design but costs some area overhead, as depicted in Figure 1e . The goal is to produce the most area-e cient design while satisfying the timing constraints.
3 The Proposed Method 3.1 Overview Figure 2 depicts the proposed design ow which consists of seven steps: 1 HDL synthesis, 2 pre-layout timing analysis, 3 soft-macro formation, 4 oorplanning, 5 place and route, 6 post-layout timing analysis, and 7 resynthesis. The input to the design ow i s a n R TL design description in Verilog. In the rst step, an HDL-based synthesizer converts the Verilog design description into a hierarchical gate-level netlist. In the second step, a timing analysis procedure is applied to perform pre-layout timing analysis of the design. A set of critical paths will be identied and used to guide the following macro-clustering, oorplanning, and placement-and-routing procedures. In the third step, the system groups soft macros connected to the same clock sources into the same cluster. Furthermore, it also groups small subcircuits to form large macros and decomposes extremely large macros into smaller ones. In the fourth step, we use a commercial oorplanner to perform macro oorplanning to determine the locations of macros. In the fth step, we use a commercial tool to perform placement and routing tasks. In the sixth step, a post-layout timing analysis procedure is invoked to compute the nal timing of the design. If there exits a timing violation or there i s a c hance for area reduction, a soft-macro resynthesis procedure is invoked. The system iterates four to the nal step until all the timing constraints are satis ed and no more area improvement can be achieved.
In the following sections, we will describe the softmacro formation step 3 and the soft-macro oorplanning and resynthesis loop steps 4-7 in details.
Soft-Macro Formation
There are two main considerations in soft-macro formation. First, in many o f t o d a y's applications, such a s m ultimedia chips, designs usually have m ultiple clock sources with di erent rates. It is bene cial to group soft macros associated with the same clock source into the same cluster. Second, using an HDL-based synthesis method, the synthesized subcircuit of each leaf module is naturally a closely-connected cluster. However, a design may also contain extremely large modules containing tens of thousands of gates. This is undesirable because a large cluster is too rigid for macro placement and may often result in poor placement results. Furthermore, a design may also contain a large number of small subcircuits. This is also undesirable because a large number of macros will increase the computational complexity of the macro-cell placement process.
The soft-macro formation procedure consists of three steps: 1 clock-based clustering, 2 large-macro decomposition, and 3 small-macro clustering. In our approach, we rst use a commercial synthesis system to convert a Verilog design description into a hierarchical gate-level netlist. We then construct an HDL-based structural tree to represent the structural hierarchy o f the Verilog design description. In an HDL structural tree, the root node represents the top design, and each intermediate node represents a module construct. Each leaf node represents a circuit block generated from a leaf module.
After constructing the HDL structural tree of a design, we rst groups the macros connected to the same clock source into the same cluster. Then we determine the large-macro candidates which need to be decomposed into smaller ones. The selection of large-macro candidates is based on the size of the macros. We dene the threshold value M th of a large-macro candidate as:
M th = k S avg ; 1 where S avg is the average macro size Total cells Macros , Total cells and Macros are the total number of cells and the number of soft macros in the design, respectively, and k is a user-de ned threshold parameter for controlling the size of the large-macro. If a macro is larger than M th , then it is selected as a largemacro candidate. For each large macro, we use the FM partitioning method 8 to recursively decompose large macros into smaller clusters.
Finally, w e use a clustering algorithm 9 to group small macros into large ones based on the size constraint, and the criticality and connectivity b e t w een macros. Let G = fV;Egbe the connected graph where V is the set of macro nodes and E the set of edges. An edge e ij exists if there exists at least a signal ow between macros v i and v j . A w eight is associated with each edge indicating the number of connections between two corresponding macros. We de ne the connectivity Conn ij , the criticality Crit ij , and the closeness C ij of two macros, v i and v j , a s b e l o w. are two coe cients set by the user. In order to eliminate small macros and prevent the formation of large clusters, the user can set the upper bound on the size of a cluster. When the size of a new macro formed by merging two macros is larger than the upper bound, the closeness value between these two macros is 0.
Soft-Macro Resynthesis in Interaction with Floorplanning
After forming soft-macro clusters, a timing-driven oorplanning procedure 10 i s i n v oked to determine the relative location of each macro hard and soft macros on the layout plane. We then use a commercial tool to perform placement and routing design tasks. Subsequently, w e back-annotate RC parasitic values of the layout and perform post-layout timing analysis. Finally, w e determine whether some soft macros can be resynthesized to achieve timing and or area improvements.
The key issues for the resynthesis process are twofold. First, how to determine which soft macro should be resynthesized. Second, if a soft macro needs to be resynthesized, to what extent can its timing constraint be relaxed or tightened. Our resynthesis procedure consists of twp steps: 1 slack computation and 2 soft-macro resynthesis candidate selection.
In the rst step, we start by back-annotating the delay information for each I O port of soft macros. The delay information is extracted from a post-layout timing report. We then compute the slack v alue for each i n ter-macro signal path. Finally, w e assign a slack value for each I O port of soft macros. This value is computed using the following formula: Slackp i = MINf Slackep i ; p j ; p i 2SM k and p j 2 SM l g, where ep i ; p j denotes the interconnection between ports p i and p j , and SM k and SM l denote two soft macros.
The slack of an I O port is de ned as the minimum slack v alue of all the signal paths associated with this I O port.
In the second step, we use two cost functions to determine which soft macro should be resynthesized next so that maximal area and or timing improvement can be achieved. If there exists a negative slack v alue associated with any soft macro, a timing violation occurs. In this case, we select the soft macro with the highest NE G SM i as the candidate for resynthesis because it should be the most critical one. If all timing satis es the timing constraint, we select the soft macro with the highest P O S SM i as the candidate for resynthesis because resynthesizing it with relaxed timing constraints should result in a maximal area reduction. After selecting a candidate, we use a commercial synthesis tool to resynthesize the soft macro by specifying the I O-ports' timing constraints according to their slack v alues. Subsequently, w e i n v oke a oorplanning procedure to adjust the chip oorplan by preserving the original relative locations of all soft and hard macros.
The proposed timing-driven soft-macro resynthesis method T D S R is described below:
Procedure TDSRD hdl end of while end of procedure The inputs to the system include an HDL design description D hdl and timing constraints T const . Let D gate and S tree denote the gate-level design and structural tree. Initially, the system performs HDL synthesis, pre-layout timing synthesis, structural-tree construction, soft-macro formation, oorplanning, placeand-route, and post-layout timing analysis to produce an initial chip layout. During the resynthesis iteration, the system rst computes the slack v alue for each soft macro I O port, and then computes for each soft macro. Following, the system selects one soft-macro candidate which contributes the most in timing or area improvement. After resynthesizing the soft macro, the system performs oorplanning adjustment, followed by RC parasitic extraction and post-layout timing analysis. Finally, if there is improvement, then the resynthesis iteration continues. Otherwise, the system stops and reports the nal chip layout.
Experiments
We h a v e tested the proposed method on three industrial designs. All three designs are described as hierarchical, mixed RTL and gate-level netlists in Verilog. SMsCells=Gates, and Total Gates denote the number of nets, IO pins, hard macros, soft macros before and after performing clustering, cells gates of soft macros, and total gate-count of the design. In the three designs, ind1 contains three clock sources, and ind2 and ind3 contain two clock sources. In all experiments, we set the threshold values k = 2 for large-macro decomposition and S th = 0.1 S avg for small-macro clustering. Figure 3 shows the experimental ow. In the rst step, we used Synopsys' Design Compiler 11 to convert the input Verilog design description into a hierarchical gate-level netlist and then performs timing analysis to report the 200 most critical paths. In the second step, we used our proposed soft-macro formation method as a pre-processing step to generate soft-macro clusters for the oorplanner. In the third step, we used Cadence's Silicon Ensemble 12 to perform chip oorplanning and determine hard-macros' locations. In the fourth step, we used the performance-driven soft-macro placement algorithm 10 to perform softmacro placement. In the fth step, we used AVANT!'s Aquarious XO 13 to perform detailed placement and routing. In the sixth step, the AVANT!'s STAR,RC 14 w as used to extract layout parasitic information. In the seventh step, we used AVANT!'s STAR,DC tool 15 to perform delay calculations and generate an SDF le. In the eighth step, we used Synopsys' Design Time 11 to perform post-layout timing analysis. Finally, w e applied the proposed soft-macro resynthesis iteration to incrementally improve the area and timing of the layout. During each resynthesis iteration, we rst used Synopsys's Design Compiler to perform logic resynthesis by supplying a relaxed or a tightened timing constraint to the soft macros. We then applied an ECO function supported by A V ANT!'s Aquarious XO to perform placement and routing. For all experiments, we provided the oorplanner the third step and the placer-and-router the fth step with the most critical 200 paths generated in the rst step as the timing constraints. We h a v e conducted two sets of experiments. In the rst experiment, we used the TSMC 0.5m cell library 16 . In the second experiment, we used the TSMC 0.25m cell library 17 . Note that ind1 and ind3 contain a PLL module. Unfortunately, the 0.25m-based PLL module is not available and we could not perform the experiment on the both designs using the TSMC 0.25m cell library. Hence, in this paper, we only report the result of ind2 using the TSMC 0.25m cell library. Table 2 shows the area-delay comparisons of ind1 using the 0.5m library, in which IOdenotes the number of IO pins, HM the number of hard macros, SMB=A the number of soft macros before and after applying the soft macro clustering, Gate SM the total gate count of soft macros, Gate To t the total gate count, Area the chip area, Delay the worst path delay, T resyn the resynthesis run times in hours, and T eco the ECO run times. The results show that by resynthesizing some soft macros, the timing was improved up to 20 with almost no area penalty. T able 3 shows the areadelay comparisons of ind2 using the 0.5m library. W e obtained the same result as that of ind1, in which the timing was improved up to 13 with almost no area penalty. T able 4 shows the area-delay comparisons of ind3 using the 0.5m library. The results show that the timing was improved up to 11 with almost no area penalty. Figure 4 shows the critical path before resynthesis Iteration 1 i n T able 3. After two resynthesis iterations, the new critical path is shown in Figure 5 Iteration 3 i n T able 3. Table 5 shows the area-delay comparisons of ind2 using the 0.25m library. The results show that the timing was improved up to 30 with 11 area penalty.
We h a v e also compared the average delays contributed by gates and interconnects using 0.5m and 0.25 m technologies. Table 6 shows the average gate and interconnect delay comparisons of the most critical paths of ind2. The results show that using the 0.5m technology the average gate's intrinsic delay and interconnect delay are 0.171ns and 0.277ns, respectively. In addition, using the 0.25m technology, the average gate's intrinsic delay and interconnect delay are 0.107ns and 0.325ns, respectively. F rom the results, we observed that the average interconnect-delay vs. gatedelay ratios of the 0.5m and 0.25m technologies are 1.62 and 3.04. This indicates that interconnect delays play an important role in deep-submicron technologies.
From the experiments, the following observations can be made. When using the 0.5m library for designs ind1, ind2, and ind3 our proposed method can improve timing from 11 to 20 with almost no area penalty. This demonstrates that by e ectively relaxing the timing constraints of non-critical modules and tightening the timing constraints of the critical modules we can achieve signi cant timing improvements with little to no increases in chip area. When using the 0.25m library, our method can improve timing by 8, 22, and 30 with 2, 5, and 11 increase in chip area, respectively. W e found that the 0.25m library supports a large set of components with a wide range driven capability. This feature provides more design alternatives during the synthesis process. The experiments were conducted on an HP-C180 workstation with 750Mb main memory. Tables 3-6 show the run times for the resynthesis and P&R ECO iteration. For example, in the rst iteration of the ind1 design Table 2 , it took an average of 6 hours and 4 hours to run the synthesis and P&R ECO tasks.
Conclusions
In this paper, we h a v e presented a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip oorplanning for area and timing improvements. We h a v e conducted a series of experiments on three industrial designs. The results have demonstrated that by e ectively relaxing the timing constraints of non-critical modules and tightening the timing constraints of critical modules we can achieve signi cant timing improvements with very little to no area penalty.
In this study, w e h a v e shown that an integrated synthesis, oorplanning, placement, and routing design ow allows designers to perform design resynthesis and ECO-based placement-and-routing guided by accurate layout timing information. This method is very e ective for timing improvement with very little to no increase in chip area. One drawback for such a design ow is that it is an extremely time-consuming task. It takes close to 1 full-day to run one resynthesis iteration. Shortening the iteration time will be a key factor in improving the design exploration process. One possible approach i s t o m o v e the iteration loop to a higher-level, such as oorplanning level. In order to make this happen, a more accurate delay and area estimation method is required. Another important issue is how to determine the initial timing budget for each module before synthesis. Good initial time-budgeting should shorten the number of resynthesis iterations and thus speed up the entire design process.
