In this paper, we present an RTL design-space exploration method for high-level applications. We formulate the RTL design-space exploration into a performance-driven module selection problem. We devise a dynamic-programming algorithm to solve the problem. We pr esent an exploration ow by integrating commercial synthesis and layout tools with our proposed method. Experimental results have demonstrated that generating AT-curve for all modules is the most time consuming task in the design-space exploration process. Using the proposed 3 -p oint AT p r ojection approach, our method c an achieve on an average of 80% speed-up in run time and 90% accuracy in design estimation.
Introduction
Over past decades, academia and industry have i nvested much e ort in high-level synthesis, RTL/logic synthesis and physical design related research. By integrating various techniques, many design methods and software systems have b e e n d e v eloped for chip designs. Due to the advent of deep-submicron technology, t h e complexity of designs has increased considerably in the past few years. In order to e ectively and e ciently develop complex chips and speed-up the time-to-market, more and more integrated-circuit designers move their design entry to a higher abstraction and use an HDLbased synthesis approach t o d e v elop and manage large designs.
A t ypical chip design ow i n volves three levels o f s y n thesis tasks including high-level, RTL, and physical-level synthesis. High-level synthesis deals with the transformation of a behavioral description into an RTL design including scheduling, allocation, and binding 1, 2 , 3 ] . RTL synthesis converts a structural design int o a t e c hnology-speci c gate-level design by applying a series of optimization and technology mapping techniques. Physical synthesis performs oorplanning, placement and routing tasks to generate the nal silicon layout.
Design-space exploration at higher levels is indispensable to making intelligent design decisions and tradeo s at early design stage. In high-level design, design tradeo s can be obtained by exploring design architectures, schedules, and modules 4, 5, 6] . In RTL design, design tradeo s can be obtained by exploring target libraries, optimization procedures and physical e ects. In this paper, we focus on design-space exploration at RTL. Figure 1 depicts a typical RTL-based design ow consisting of two phases: (1) RTL sign-o and (2) RTL-based synthesis. In the rst phase, designers perform design-space exploration to evaluate all possible design alternatives and to determine whether the RTL speci cation can satisfy the design requirements. Once designers determine that the RTL speci cation meets the design requirements. they will determine the design budgets, including timing, area and power, for each module of the RTL design speci cation. Then they will sign-o the RTL design with its design budgets to the synthesis and physical design team for nal layout generation. These design budgets will be used as the design constraints throughout the entire design process, including RTL/logic synthesis, oorplanning, placement and routing. This process is also called RTL sign-o .
One important issue in the RTL sign-o process is how t o a c hieve high-con dence design estimates in the design-space exploration process. Designers rely on these design estimates to make their RTL sign-o decision. If the design estimates are not reliable, then the sign-o decision may lead to an inferior design. This motivates us to investigate a design-space exploration method and ow that is able to provide highcon dence design estimates.
In this paper, we present a n R TL design-space exploration method for high-level applications. W e formulate the RTL design-space exploration into a performance-driven module selection problem. W e present a dynamic programming algorithm to solve t h e problem. W e also present a design ow b y i n tegrating our proposed algorithm with commercial R TL/logic synthesis and physical design tools for RTL designspace exploration. Finally, w e present experimental results to demonstrate the e ectiveness of the proposed method.
Related Work
McFarland used the BUD system 7, 8] to perform a n umber of experiments to demonstrate that a simple layout model ignoring interconnects and other layout factors is inadequate to provide accurate estimates during the synthesis process. Parker et al., also performed a series of experiments to show t h e e ects of physical design characteristics on the areaperformance tradeo curve d u r i n g t h e s y n thesis process 9]. In 10, 11] abstracted layout area and timing models for high-level synthesis were presented. These models considered several layout factors, including layout architectures, placement, and routing. Experiments have shown that the proposed models can accurately and e ciently re ect the e ects of the datapath design tradeo s. In 12] a l a yout predictive model was proposed to take i n to account the e ects of wiring and oorplanning on the area and performance estimations of RTL designs.
LAST 13] and TELE 14] used a combination of analytical and constructive t e c hniques to estimate the area and delay of a netlist of cells. It partitions the circuit repeatedly into a slicing tree in which the level of the slicing tree is speci ed by the user. The shape function of each leaf cell is then estimated using an analytical model. Because the constructing level is controlled by the user, this approach permits the user to trade o the accuracy of the prediction versus the run time of the predictor. Jain et al. 4 ] proposed a mathematical model to predict the area-delay trade-o curve for pipelined and nonpipelined data paths from a data ow graph and a c hoice of module style. K u c uk cakar and Parker 5] proposed estimation techniques to perform designspace exploration and evaluation to support systemlevel partitioning. In addition, quality measures and estimation techniques for high-and system-level synthesis have been addressed in 3, 6] .
Recently, Srinivasan et al. 15 ] presented a method to estimate chip area and path delays from an RTL behavioral description. They observed that about 80% of the total design time is spent on technology-dependent area and delay optimization. Therefore, they proposed a method to perform area and delay estimations on technology-independent designs. They rst extracted design parameters from di erent implementations and then applied the best-t polynomial area and delay models on the resulting technology-independent design to estimate its area and delay. 3 RTL Design Space Exploration
Problem De nition and Considerations
The RTL design-space exploration problem is dened as: Given an RTL design described i n H D Ls (Hardware Descriptive Languages) and a speci c technology-dependent cell library, determine all possible design implementations with various AT ( A reaTime) characteristics.
An RTL design consists of a set of interconnected modules. Each module contains either a combinational or a sequential circuit. If it is a sequential circuit, all the latches and ip-ops are located on the output boundary of the module. (Note that we follow t h e d esign guideline "For each block of a hierarchical design, register all output signals from the block" suggested by the Reuse Methodology Manual 19] ). The functionality o f e a c h module is described either in behavior or logic level.
There are two main concerns for RTL design-space exploration. The rst one is that each m o d u l e c a n be synthesized into various gate-level designs with di erent A T c haracteristics by applying various design constraints and optimization techniques. We can run through a series of synthesis tasks on each m o d u l e i n order to obtain accurate design characteristics. However, it's an extremely time-consuming task. The second one is how to determine the best design implementation for each module such that the total area cost is minimized while satisfying the timing constraint. The input to the RTL DSE procedure is an RTL netlist. In the rst step, it invokes procedure AT Curve Proj to project AT c haracteristics of each module. In our approach, we rst use a commercial synthesis tool to generate three design alternatives for each module, one with the fastest timing, one with the medium timing, and one with the slowest timing. The designs with fastest and slowest timing will be treated as the lower-and upper-bound timing of the module.
The Proposed Method
In the second step, we use a commercial layout tool to estimate the inter-module wire delays (Wire Delay Est()) b y performing a module-based placement procedure. In our approach, we rst estimate the inter-module wire length by performing module-based placement using the smallest-area (i.e., the slowest timing) design for all modules. These intermodule wire lengths are treated as the lower-bound wire lengths (d min (e ij )) between modules. Then, we estimate the inter-module wire length by performing module-based placement using the fastest-timing (i.e., the largest area) design for all modules. These intermodule wire lengths are treated as the upper-bound wire lengths (d max (e ij )) between modules.
In the third step, procedure AT Bound Est() determines the lower-and upper-bound timings fT min T max g of the RTL design, which are computed by performing timing analysis using the fastest-delay and smallest-area designs for all modules, respectively.
The design-space exploration part consists of two while loops. We rst set the upper-bound timing T max as the time constraint of the design. Initially, w e select the smallest-area design for all modules (Init Time Assign()). If more improvements can be achieved, either delay or area reduction, the procedure will rst invoke t h e Wire Delay Est() procedure to project the inter-module interconnect delays. Intuitively, the inter-module wire lengths are proportional to the total area of the design. Hence, we project the inter-module interconnect delays as where d max (e ij ) and d min (e ij ) are the upper-and lower-bound inter-module interconnect delays, respectively. A max and A min are the total areas using the fastest-timing and smallest-area designs for all modules, respectively.
After determining the interconnect wire delays, we invoke the Timing Analysis procedure to identify the most critical path (G k ) and then invokes the proposed Performance-Driven Module Selection (PDMS DP) a lgorithm, which will be discussed in the next section, to select a new set of designs for all modules. The inner while loop will be executed till no more improvement can be achieved and all signal paths in the design satisfy the timing constraint. After that, we will tighten the timing constraint b y a constant ( t) which is set by the user, and repeat the outer while loop to continue the design-space exploration process. Each module has an AT-curve that represents the possible design alternatives of the module. The objective of the performance-driven module selection problem is to nd a solution for all the modules such that the total area-cost (a 1 + a 2 + a 3 ) i s m i n i m ized subject to satisfying the timing constraint ( ( t 1 + t 12 + t 2 + t 23 + t 3 ) T const ).
Preliminaries
We use a connected graph to represent a n R TL design. Let G = ( V E) represent a n R TL netlist, where V = fv i jI = 1 ::ng denotes a set of modules and E = fe ij jv i v j 2 V g a set of interconnections. For each module, it has m possible implementations with various area-delay c haracteristics. Let v ij denote the module v i with the j th implementation and fa ij t ij g the farea,delayg cost of the j th implementation o f module v i where j = 1 ::m. L e t AT (V ) b e a s e t o f a l l possible implementations for all modules, d(e ij ) the interconnect delay b e t ween modules i and j, and T const the given timing constraint. G k denotes a subgraph containing a set of modules that are on a signal path of k and Sla k is the slack v alue of path k.
The Dynamic-Programming Algorithm
Given a path with n modules and each module with m implementations (i.e., m area-delay c haracteristics), let A n (SR n ) = fa 1 , a 2 ,...,a n g be a minimal-area solution for implementation selection for module 1 to module n, where SR n is the remaining slack v alue up to module n (SR n 0). Then, for each i, 1 i n, A i (SR i ) and A i+1 (SR i+1 ) m ust be minimal-area solutions for implementation selection for module 1 to module i and for module i+1 to module n, respectively. F rom the principle of optimality it follows that: Let S i be a set of 3-tuple instances (j a t) that represents the update area a and delay t from module 1 to module i when selecting the j implementation for module i. The inputs to the algorithm include the RTL netlist (G), a subnetlist (G k ), a timing constraint (T const ), and the set of all possible implementations for all modules. The output is a new set of implementations of modules. In the rst step, the algorithm computes the slack v alue of the given subnetlist. In the second step, the algorithm computes the 3-tuple instances for modules 1 to n ; 1 ( S 1 -S n;1 ). In the third step, the algorithm computes S n . F i n a l l y , i t i n vokes procedure Trace Back() to determine the implementation selection for all modules.
Experiments
We h a ve implemented the proposed method in the C and Perl programming languages. We h a ve tested our proposed method on four benchmarking designs, as shown in Table 1 . The rst design is a 64-bit simple processor (SP). The second design is a 64-bit elliptic lter. The third design is a large controller. The fourth design is an SDRAM controller. All four designs are described as hierarchical RTL netlists in Verilog. Table 1 shows the characteristics of the benchmarks in which # Mods(Seq:=Comb:), #Inter ; Ne t s , a n d #IOs denote the number of modules, inter-module nets, and input/output pins. In all experiments, we used a 0.6 m cell library 20].
We h a ve conducted three sets of experiments: RTL design-space exploration using a (1) 2-point A T projection, (2) 3-point A T projection, and (3) actual AT curve method. In the rst step, we used Synopsys's Design Compiler 16] to synthesize each leaf module with two options: (1) minimizing area (set max area 0) and (2) minimizing delay ( set max delay 0 -tp all outputs()), which resulted in two design alternatives (A a T a ) a n d ( A t T t ). We treated T a and T t as the upper-and lower-bound timing of the module. Then, we s y n thesized the leaf module with a given timing constraint o f ( T a + T t )=2 to generate a third design alternative ( A mid T mid ). Then, we projected the ATcurves based on f(A a T a ), (A t T t )g (2-point AT p r ojection) and f(A a T a ), (A mid T mid ), (A t T t )g (3-point AT p r ojection). In our 2-and 3-point A T-curve projection, we assume that the delay and area of each m o d u l e have a linear relation.
For the actual AT curve method, we iteratively synthesized each module to generate the area-delay characteristics by increasing the time constraint f r o m the upper-bound timing to the lower-bound timing (set max delay time temp -to all outputs() ). In total, we generated 10 design alternatives for each module.
In the second step, we used Cadence's Silicon Ensemble 17] to perform a block placement procedure and then estimated the inter-module wire lengths based on the Manhattan distance model. In the third step, we used Synopsys's Design Time 16] to perform timing analysis and report the lower-and upper-bound timing of the design. In the fourth step, we i n voked the proposed RTL design-space exploration procedure to generate AT-curve of the design. During the designspace exploration process, we used Synopsys's Design Time to perform timing analysis and report the most critical paths. Table 2 show that majority of run times were consumed for generating AT-curve of all modules. Using the 2-point and 3-point A T projection methods, we only needed to run the synthesis process 2 or 3 times instead of 10 times for each module. Hence, we can easily reduce the total run time by 70%-80%. Now, the question is \how accurate the design estimates using the 2-point and 3-point A T projection methods compare to the actual designs?" In order to demonstrate the accuracy of design estimations based on the 2-and 3-point projected ATcurves of leaf modules, we performed the experiment as follows. First, we performed RTL design-space exploration using the 10-point a c t u a l A T c haracteristics of each module and then generated the nal designs. The results generated in the rst step will be treated as the actual nal designs. Second, we used the 2-point and 3-point methods to project the AT-curve for each module. Third, we applied the RTL DSE algorithm to determine the timing constraint for each leaf module. Fourth, we used the timing constraints obtained in the third step as the timing constraint for each module and invoked Synopsys's Design Compiler to synthesize each m o d u l e i n to a gate-level netlist. Finally, w e perform timing analysis on the resultant designs.
For example, using the actual AT-curve (i.e., 10 actual design points for each module), the results (Figure 4) show that the nal timing is 28.66s and the design consists of 20,127 gates. Using the projected AT-curve, we rst used the 2-point and 3-point m e t hods to project the AT-curve for each module Then we ran our proposed method to predict the timing require- ment for each module. After that we used the predicted timing requirement as the timing constraint a n d i nvoked Synopsys's Design Compiler to synthesize each module into a gate-level design. Finally, w e performed timing analysis on the design. The results show t h a t the resultant designs using the 2-point and 3-point A T projection methods required 27,057 and 23,672 gates to achieve the same timing. Figures 4, 5, 6 , and 7 show the comparisons between the nal designs generated using the actual and projected AT-curves of the SP, Elliptic Filter, controller, and SDRAM controller, respectively. The results show that in most cases the designs generated using the projected AT-curves are consistently re ected the designs generated using the actual AT-curves. One exception is the Controller design. When the timing is 22.71 and using the 2-point A T projection method, the resultant design is 31.8% larger than the design generated using the actual AT-curve method. The reason is that this design contains too many inferior modules when synthesized them with the projected timing constraints. Table 3 shows the comparisons of the maximum (E max ) and average (E ave ) errors between the designspace exploration using the actual and projected ATcurves of leaf modules. The results show that the average maximum and average errors using the 3-point AT projection method are 11.3% and 6.8%, which a r e better than that (19.6% and 12.2%) using the 2-point AT projection method.
Conclusions
In this paper, we h a ve presented an RTL designspace exploration method for high-level applications. In our approach, we h a ve i n tegrated commercial synthesis and layout tools with our proposed algorithm for design-space exploration. The design ow c a n b e executed automatically under controlled using a Perlbased script. In the experiments, we h a ve conducted two designspace exploration methods: (1) using the actual ATcurves and (2) using the projected AT-curves of leaf modules. We h a ve learned that in order to generate actual AT c haracteristics for all modules, we need to synthesize each module with various timing constraints. However, this procedure is extremely time consuming. For medium-sized designs, it required several hours to fully explore the design space. If we can tolerate some errors for design estimations, 3-point A T projection method will be a good choice to speed up the design-space exploration process. We believe that for medium-sized designs the modest-long run time is still acceptable by most of designers when the design-space exploration process is executed in a fully automatic way. H o wever, for large designs, the run time may i ncrease drastically that may not be acceptable by m a n y designers. Hence, how t o d e v elop a fast and accurate RTL module-based AT projection method needs to be studied further. In addition, the e ect of wiring delays on large designs also needs to be studied further.
