In this paper, we present an RTL design-space ezploration method for high-level applications. We formulate the RTL design-space ezploration into a performance-driven module selection problem. We devise a dynamic-programming algorithm to solve the problem. We present an ezploration pow by integrating commercial synthesis and layout tools with our proposed method. Ezperimental results have demonstrated that generating AT-curve for all modules is the most time consuming task in the design-space ezploration process. Using the proposed 3-point AT projection approach, our method can achieve on an average of 80% speed-up in run time and 90% accuracy in design estimation.
Introduction
get; to the iynthesis and physical design team Tor final Over past decades, academia and industry have in-layout generation. These design budgets will be used as the design constraints throughout the entire design vested much effort in high-1eve1 synthesis, RTL/lOgic process, including RTL/logic synthesis, floorplanning, synthesis and physical design By inte-placement and routing. This process is also called RTL Due to the advent of deep-submicron technology, the One important issue in the RTL sign-oflprocess is past few years. In order to effectively and efficiently de-deskn-sPace Process. Designers rely on velop complex chips and speed-up the time-to-mmket, these design estimates to make their RTL sign-ofl de- grating various techniques, many design methods and software systems have been developed for chip designs.
complexity of designs has increased considerably in the how to achieve high-confidence design estimates in the their cision. If the design estimates are not reliable, then design entry to a higher abstraction and use an HDL-the sign-08 decision may lead to an inferior design.
designs. ration method and flow that is able to provide high-A typical chip design flow involves three level-confidence design s of synthesis tasks including high-level, RTL, and
In this Paper, we Present an RTL design-space physical-level synthesis. High-level synthesis deals exploration method for high-level applications. We into an RTL design including scheduling, allocation, performance-driven module selection problem. We and binding [l, 2, 31 . RTL synthesis converts a gtruc-present a dynamic programmingalgorithm to solve the tural design into a technology-specific gate-level de-problem. We also present a design flow by integrating sign by applying a series of optimization and technol-our proposed algorithm with commercial RTLjlogic ogy mapping techniques. Physical synthesis performs synthesis and physical design tools for RTL designfloorplanning, placement and routing tasks to generate Space exPloratiOn. Finally, we Present experimental the final silicon layout.
results to demonstrate the effectiveness of the proposed Design-space exploration at higher levels is indis-method. also performed a series of experiments to show the effects of physical design characteristics on the areaperformance tradeoff curve during the synthesis process [9] .
In [lo, 111 abstracted layout area and timing models for high-level synthesis were presented. These models considered several layout factors, including layout architectures, placement, and routing. Experiments have shown that the proposed models can accurately and efficiently reflect the effects of the datapath design tradeoffs. In [12] a layout predictive model was proposed to take into account the effects of wiring and floorplanning on the area and performance estimations of RTL designs.
LAST [13] and TELE [14] used a combination of analytical and constructive techniques to estimate the area and delay of a netlist of cells. It partitions the circuit repeatedly into a slicing tree in which the level of the slicing tree is specified by the user. The shape function of each leaf cell is then estimated using an analytical model. Because the constructing level is controlled by the user, this approach permits the user to trade off the accuracy of the prediction versus the run time of the predictor.
Jain et al. [4]
proposed a mathematical model to predict the area-delay trade-off curve for pipelined and nonpipelined data paths from a data flow graph and a choice of module style. Kiisiikqakar and Parker [5] proposed estimation techniques to perform designspace exploration and evaluation to support systemlevel partitioning. In addition, quality measures and estimation techniques for high-and system-level synthesis have been addressed in [3, 61.
Recently, Srinivasan et al. [ 15 presented a method havioral description. They observed that about 80% of the total design time is spent on technology-dependent area and delay optimization. Therefore, they proposed a method to perform area and delay estimations on technology-independent designs. They first extracted design parameters from different implementations and then applied the best-fit polynomial area and delay models on the resulting technology-independent design to estimate its area and delay.
to estimate chip area and path de \ ays from an RTL be- 
RTL Design Space Exploration

Problem Definition and Considera-
The RTL design-space exploration problem is defined as: Given an RTL design described in HDL- An RTL design consists of a set of interconnected modules. Each module contains either a combinational or a sequential circuit. If it is a sequential circuit, all the latches and flip-flops are located on the output boundary of the module. Note that we follow the deregister all output Sagnab from the block" suggested by the Reuse Methodology Manual [19] ). The functionality of each module is described either in behavior or logic level.
There are two main concerns for RTL design-space exploration. The first one is that each module can be synthesized into various gate-level designs with different AT characteristics by applying various design constraints and optimization techniques. We can run through a series of synthesis tasks on each module in order to obtain accurate design characteristics. However, it's an extremely time-consuming task. The second one is how to determine the best design implementation for each module such that the total area cost is minimized while satisfying the timing constraint. 
The Proposed Method
end-while; end.
The input to the RTLDSE procedure is an RTL netlist.
In the first step, it invokes procedure AT-CurueSroj to project AT characteristics of each module. In our approach, we first use a commercial synthesis tool to generate three design alternatives for each module, one with the fastest timing, one with the medium timing, and one with the slowest timing. The designs with fastest and slowest timing will be treated as the lower-and upper-bound timing of the module.
In the second step, we use a commercial layout tool to estimate the inter-module wire delays ( WireBelay-Est( ) by performing a module-based timate the inter-module wire length by performing module-based placement using the smallest-area (i.e., the slowest timing) design for all modules. These intermodule wire lengths are treated as the lower-bound wire lengths (&in(ei,)) between modules. Then, we estimate the inter-module wire length by performing module-based placement using the fastest-timing (i.e., the largest area) design for all modules. These intermodule wire lengths are treated as the upper-bound wire lengths (Lax ( e i j ) ) between modules.
In the third step, procedure AT_Bound-Est() determines the lower-and upper-bound timings { Tminl Tma} of the RTL design, which are computed by performing timing analysis using the fastest-delay and smallest-area designs for all modules, respectively.
The design-space exploration part consists of two while loops. We first set the upper-bound timing T,,,, as the time constraint of the design. Initially, we select the smallest-area design for all modules (Init-TimeASsign()). If more improvements can be achieved, either delay or area reduction, the procedure will first invoke the WireDelay-Est ) procedure itively, the inter-module wire lengths are proportional to the total area of the design. Hence, we project the inter-module interconnect delays as placement proce d ure. In our approach, we first esto project the inter-module interconnect i elays. Intu- where dma+(e;j) and &;n(e;j) are the upper-and lower-bound inter-module interconnect delays, respectively. Amax and Amin are the total areas using the fastest-timing and smallest-area designs for all modules, respectively.
After determining the interconnect wire delays, we invoke the TimingAnalysis procedure to identify the most critical path (Gk) and then invokes the proposed Performance-Driven Module Selection (PDMS-DP) algorithm, which will be discussed in the next section, to select a new set of designs for all modules. The inner while loop will be executed till no more improvement can be achieved and all signal paths in the design satisfy the timing constraint. After that, we will tighten the timing constraint by a constant (at) which is set by the user, and repeat the outer while loop to continue the design-space exploration process.
Performance-Driven Module Selec-
Problem Definition
The performance-driven module selection problem is defined as: Given a signal path which traverses through a set of modules, a set of design alternatives for each module with various AT (Area-Time) characteristics and a timing constraint, determine the design implementation for each module such that the overall area cost is minimized while satisfying the given timing constraint.
represent the AT-curve function of mod- Figure 3 illustrates an example in which a path traverses from I to 0 via three modules ml, ma and ms. Each module has an AT-curve that represents the possible design alternatives of the module. The objective of the performance-driven module selection problem is to find a solution for all the modules such that the total area-cost (u1+ a2 + u3) is minimized subject to satisfying the timing constraint ( ( t l + t u + t a + t23 +t3) 5
Preliminaries
We use a connected graph to represent an RTL de- Let S; be a set of 3-tuple instances ( j , U , t ) that r e p resents the update area a and delay t from module 1 to module i when selecting the j implementation for module i. The inputs to the algorithm include the RTL netlist (G), a subnetlist (Gk), a timing constraint (Tcmst), and the set of all possible implementations for all modules. The output is a new set of implementations of modules. In the first step, the algorithm computes the slack value of the given subnetlist. In the I+ Actual + 2-pt Estimation + 3-pt Estimation I Figure 4 : The comparisons between the design-space exploration using the actual and projected AT-curves: the SP design. pt, 2-pt denote the total run time for generating 10,3, and 2 design alternatives for each module in the design. R T L S S E denotes the run times for the RTL designspace exploration process. All experiments were running on a SUN Ultra60 with 1G memory. The results show that majority of run times were consumed for generating AT-curve of all modules. Using the 2-point and 3-point AT projection methods, we only needed to run the synthesis process 2 or 3 times instead of 10 times for each module. Hence, we can easily reduce the total run time by 70%-80%. Now, the question is "how accurate the design estimates using the 2-point and 3-point AT projection methods compare to the actual designs?" In order to demonstrate the accuracy of design estimations based on the 2-and 3-point projected ATcurves of leaf modules, we performed the experiment as follows. First, we performed RTL design-space exploration using the 10-point actual AT characteristics of each module and then generated the final designs. The results generated in the first step will be treated as the actual final designs. Second, we used the 2-point and 3-point methods to project the AT-curve for each module. Third, we applied the RTL-DSE algorithm to determine the timing constraint for each leaf module. Fourth, we used the timing constraints obtained in the third step as the timing constraint for each module and invoked Synopsys's Design Compiler to synthesize each module into a gate-level netlist. Finally, we perform timing analysis on the resultant designs. For example, using the actual AT-curve (i.e., 10 actual design points for each module), the results (Figure 4) show that the final timing is 28.66s and the design consists of 20,127 gates. Using the projected AT-curve, we first used the 2-point and 3-point methods to project the AT-curve for each module Then we ran our proposed method to predict the timing require- I + Actual + 2-pt Estimation -C 3-pt Estimation 1 Figure 5 : The comparisons between the design-space exploration using the actual and projected AT-curves: the Elliptic Filter design. The comparisons between the design exploration using the actual and projected AT-curves.
ment for each module. After that we used the predicted timing requirement as the timing constraint and invoked Synopsys's Design Compiler to synthesize each module into a gate-level design. Finally, we performed timing analysis on the design. The results show that the resultant designs using the 2-point and 3-point AT projection methods required 27,057 and 23,672 gates to achieve the same timing. Figures 4,5,6 , and 7 show the comparisons between the final designs generated using the actual and projected AT-curves of the SP, Elliptic Filter, controller, and SDRAM controller, respectively. The results show that in most cases the designs generated using the projected AT-curves are consistently reflected the designs generated using the actual AT-curves. One exception is the Controller design. When the timing is 22.71 and using the 2-point AT projection method, the resultant design is 31.8% larger than the design generated using the actual AT-curve method. The reason is that this design contains too many inferior modules when synthesized them with the projected timing constraints. Table 3 shows the comparisons of the maximum (Emaz) and average (Eave) errors between the designspace exploration using the actual and projected ATcurves of leaf modules. The results show that the average maximum and average errors using the %point AT projection method are 11.3% and 6.8%, which are better than that (19.6% and 12.2%) using the 2-point AT projection method.
Conclusions
In this paper, we have presented an RTL designspace exploration method for high-level applications.
In our approach, we have integrated commercial synthesis and layout tools with our proposed algorithm for design-space exploration. The design flow can be executed automatically under controlled using a Perlbased script. 1 -C ACUI * Z-PI Estimation --3-pt Estimation j Figure 6 : The comparisons between the design-space exploration using the actual and projected AT-curves: the Controller design.
In the experiments, we have conducted two designspace exploration methods: (1) using the actual ATcurves and (2) using the projected AT-curves of leaf modules. We have learned that in order t o generate actual AT characteristics for all modules, we need to synthesize each module with various timing constraints. However, this procedure is extremely time consuming. For medium-sized designs, it required several hours to fully explore the design space. If we can tolerate some errors for design estimations, 3-point AT projection method will be a good choice t o speed up the design-space exploration process. We believe that for medium-sized designs the modest-long run time is still acceptable by most of designers when the design-space exploration process is executed in a fully automatic way. However, for large designs, the run time may increase drastically that may not be acceptable by many designers. Hence, how to develop a fast and accurate RTL module-based AT projection method needs t o be studied further. In addition, the effect of wiring delays on large designs also needs t o be studied further. Figure 7 : The comparisons between the design-space exploration using the actual and projected AT-curves: the SDRAM Controller design.
