Accurate design descriptions during synthesis allow efficient use of resources. The appropriate use of distinct implementations of RTL operators helps generate optimal VLSI designs.
INTRODUCTION
There are many parameters of a final chip design, such as area, power, and performance, which must be addressed to make the chip useful, and its manufacture profitable. At each step of the design process, these parameters must be accurately estimated and used to guide the progress of the design towards a high quality chip. Behavioral synthesis typically suffers from inaccurate estimates because there is little information about the physical layout at the algorithmic level of description. Recently, attempts have been made to connect behavioral synthesis to lower levels of design (floorplanning [31] , placement and routing [26] ) to achieve better estimates during synthesis. Since design decisions made at the behavioral synthesis level have a great impact on the quality of the final design, it is crucial that good estimates be used to direct synthesis.
The goal of behavioral synthesis is to generate a datapath from an algorithmic description of the de-*Corresponding author. 167 168 I.G. HARRIS and A. ORAILOGLU sired functionality of the chip. The first step in the process is usually the generation of an intermediate algorithmic representation [24] , composed of a control flowgraph describing conditional branching and looping constructs in the behavioral description, and a dataflow graph describing the dataflow dependencies between operations in the algorithm. The task of deciding at which clock cycle each dataflow operation will be performed is called scheduling. The allocation task allots hardware modules to perform the operations in the dataflow graph, and the operator binding task assigns each dataflow operation to a particular allocated hardware module which will perform it. Additionally, an RTL description of the control unit must be generated to sequence the operations as described in the schedule. A layout can then be generated from the RTL description of the entire chip using standard physical design tools.
In the design of a chip with any reasonable degree of complexity, it is likely that more than one implementation of an operator will be utilized. For instance, a slow ripple carry adder could be used as well as a carry-lookahead or carry-select adder. These three different types of adders will have different area and delay characteristics which should be considered during behavioral synthesis while the datapath is being created. The characteristics of the modules used during synthesis should be close to the characteristics of the modules which will be used in the physical design so that accurate estimates of delay and area consumption can be made. Delaying the resolution of abstract modules to actual components until after microarchitectural synthesis may result in inefficient area usage and reduced throughput.
In this paper, we propose integration of module selection into the scheduling and allocation tasks of behavioral synthesis. Traditional hardware allocation is performed under the assumption that only one module type of each functionality exists in a design library. Allocation with module selection allows us to obtain a better estimate of the area of the final design by using area information from a full library rather than the rough approximation afforded by only a single module type. [28] , or area-constrained such as List Scheduling [16] , have been explored [5] . To our knowledge, no approach besides the ILP formulations ( [6, 22] ) performs synthesis under both an area and a delay constraint. Although promising execution time results have been shown, the ILP problem is NP-complete and remains intractable for challenging synthesis problems.
Various forms of the module selection problem have been explored previously. The problem of module set selection through resolution to a restricted library containing a single module type for each operator functionality prior to microarchitectural synthesis has been explored in [14, 15, 17] . This approach selects a single module type for each functionality which will be used in the allocation by generating an area/delay design curve for each possible subset of module types. The selection of a module set and a corresponding clock period has been studied in [3] . The selection of a single module type for each functionality negatively impacts the scheduling of flowgraphs that contain paths of varying criticality; it is desirable to perform critical path nodes on fast modules and non-critical path nodes on slow modules in this case. Selection of a single module type makes this tradeoff impossible.
A module selection algorithm has been proposed by Ramachandran and Gajski [30] which performs component selection in conjunction with scheduling and operator binding using a distribution graph model [28] to estimate the effect of each compound decision on the area and performance of the final design. Ramachandran and Gajski's work expands this model by computing the distribution of each module type in the module library. Ishikawa and De Micheli [13] propose a module selection algorithm which uses heuristics to select modules types while meeting a latency constraint. A module selection algorithm for pipelined datapaths is proposed in [23] which uses a detailed module delay model, requiring increased CPU time.
The algorithm proposed in [8] performs allocation with module selection before scheduling, by using a hill climbing technique to explore the search space of different allocations. Since allocation is performed before scheduling, allocation decisions cannot make use of scheduling information to achieve improved results.
Scheduling of pipelined datapaths has been studied in several research projects such as [27, 18, 11, 7, 12] . Even though both area and performance are frequently critical due to the stringent throughput requirements of DSP applications, module selection has not been commonly incorporated.
SYSTEM OVERVIEW
The basic components of the algorithm are heuristic synthesis, time-and-area constrained synthesis, and area estimation. The heuristic synthesis component uses heuristic measures to choose scheduling, allocation, and module selection decisions to be included in the design. The time-and-area constrained synthesis component examines the design state after each heuristic decision and prunes away options that can be seen to lead to area or delay constraint violations. The area estimation component is used by the time-andarea component to determine which design decisions will lead to infeasible designs. When an allocation decision is made, the feasible module set of a flexible module is pruned.
While scheduling decisions determine the clock cycle at which a node will be executed, module selection decisions determine the flexible module which will perform the operation. Each node has a set of feasible clock cycles to which it may be scheduled. When a node is scheduled, this set is reduced to a single clock cycle. Each flowgraph node also has a feasible module set and can only be bound to a flexible module whose feasible module set shares some module types in common with the feasible module set of the node. This algorithm performs these two types of decisions in an intertwined fashion to allow all three tasks to benefit from partial design informa- tion during synthesis. The degree of intertwining is controlled by a user-defined parameter tx i.
The time-and-area constrained synthesis component is used to prune design options which can be shown to result in constraint violations. This is achieved by considering area consumption in performance determination while control step assignment possibilities are considered in area determination.
The area estimation component generates an estimate of the area by predicting an allocation which is minimally sufficient to perform the flowgraph nodes given the current state of scheduling. The area estimate is used by time-and-area constrained synthesis to determine which design options lead only to infeasible designs, and the predicted allocation produced is compared to the current allocation to determine if new modules need to be allocated.
Hardware utilization can be improved by chaining operations, that is, allowing two or more operations to be performed serially within one clock cycle. This alleviates underutilization by allowing hardware to be used in time in a clock cycle that would otherwise be wasted. The earliest and latest times at which a node can be scheduled, C and C respectively, are kept as a clock cycle and a time displacement within the clock cycle to enable appropriate handling of chaining. This paper will first describe how time-and-area constrained synthesis assures that both constraints are met. Subsequently the area estimation algorithm will be presented followed by a description of the heuristic scheduling, module selection and allocation algorithm. Results demonstrating the effectiveness of the basic algorithm will be presented. Then synthesis of pipelined systems will be discussed, and results of pipelined synthesis with.module selection will be presented. The method of estimating the area is an application of the pigeonhole principle [2] to nodes confined to ranges of clock cycles. We will use the dataflow graph in figure 3 to demonstrate the use of the pigeonhole principle in predicting a minimum allocation. In figure 3 , four nodes must be scheduled within three clock cycles, therefore, by the pigeonhole principle, at least two addition modules must be allocated. The pigeonhole principle can be analogously generalized to consider nodes which are confined to ranges of clock cycles, as well as module types.
TIME-AND-
Our approach is illustrated in the dataflow graph of figure 4 wherein the three shaded nodes are scheduled to clock cycles (+4, +5, +6) while the other three are free to be scheduled over all three clock cycles (assuming chaining). The unscheduled nodes are annotated with their feasible module types.
Clearly nodes + 1, +2, and +3 must be scheduled within clock cycles 1, 2, and 3, and the total module availability over that range of clock cycles and over the feasible module range, ({Fast, Med} modules) is three. The availability in the range is figured by counting the number of clock cycles at which each We propose an heuristic based approach to choose a design option in a computationally efficient manner.
Heuristic decisions are of two types, a combined scheduling/module selection decision or alternately an allocation decision. Each scheduling/module selection decision commits a node in the dataflow graph to be performed at a clock cycle, and to be performed by a particular flexible module. Each allocation decision refines the real allocation by pruning a feasible module type from a module.
The heuristic subsystem alternates between a scheduling/module selection phase and an allocation phase. We have observed that the order in which scheduling, module selection, and allocation are performed impacts the optimality of the design. Completion of one phase may limit the solution spaces of the subsequent phases in such a way that no feasible solutions which satisfy both area and delay constraints remain.
This algorithm intertwines the tasks of scheduling, allocation, and module selection, so that each task can be guided by partial information from the others. 
I.G. HARRIS and A. ORAILOGLU
Experimental results show that the degree to which scheduling and allocation decisions are intertwined can have a significant effect on the quality of the design.
Scheduling/Module Selection Decisions
The system schedules each node to a clock cycle, and binds each node to an allocated module. A node is committed to a clock cycle and bound to a flexible module simultaneously. First the node which is most ready to be committed is determined, and then the best clock cycle and module for node commitment is chosen. The node is selected based on the following criteria.
chain of nodes rather than a single node. If the path scheduling freedom for a node is low then it is on a path whose completion time is large compared to the maximum time in which the path must complete to meet the delay constraint. The nodes on such critical paths should be scheduled early because they have less freedom.
The flexible module to which a node will be bound and clock cycle at which a node will be performed are selected based on the criteria listed below. 
Intertwining Threshold
The user provides an input parameter, Oi, which determines how much scheduling information is needed before an allocation decision can be made. This parameter is used to control the degree of intertwining of the scheduling/module selection decisions and the allocation decisions. When deciding whether or not to prune the feasible module types of a flexible module, a weighted average of the scheduling freedoms of the paths containing the bound nodes is compared to c and pruning is performed if the weighted average is greater than o i. Low values of oi cause allocation decisions to be eager, which provides early direction to the scheduling, while high values cause scheduling decisions to be eager, giving early direction to allocation.
EXPERIMENTAL NON-PIPELINED SYNTHESIS RESULTS
We have conducted a set of experiments to test the ability of the heuristics to navigate, and of the timeand-area constrained synthesis to prune the search space. The first example in figure 5 demonstrates that the effects of the time and area constraints are successfully enforced. In this example, time-and-area constrained synthesis pruned all decisions that could be deduced infeasible from the initial constraints. Since almost all decisions were automatically pruned as a result of constraint enforcement, all scheduling, module selection, and allocation decisions were completed except for the limited freedom of nodes +X2 and +X14. In this example, the constraint enforcement part of the system automatically assigned all addition operations to medium speed adders, and all multiplication operations to slow multipliers.
In another set of experiments, we studied the ability of the heuristics to guide the search through the design space under tight constraints. We scheduled the differential equation example [28] , the AR-filter [27] and the FIR-filter [27] flowgraphs with constraints and results shown in figures 6, 7, and 8, respectively. Under the given constraints, the solutions identified by the algorithm are the only feasible solutions. The solutions use a rich set of modules and would not have been feasible under the single module type assumption.
To demonstrate that module selection produces designs which utilize area and delay resources more efficiently than designs generated which use a single module type for each operator, we compared the results of our system to results generated by the HAL [29] algorithm on the AR-filter dataflow graph. The resulting area-time curves are shown in figure 9 . A clock cycle duration of 250 ns was used for this experiment.
The HAL algorithm considers only one module of each functionality, .so we supplied it with each pair of adder and multiplier modules in the library shown in figure 10 . Our system, which uses the full library, is a better area/delay curve than HAL in almost every case.
In order to investigate the effect of changing the degree of intertwining of scheduling and allocation decisions, we performed scheduling on the FIR-filter example with different degrees of intertwining by changing the value of o i. The results are shown in figure 11 . The area of each result is marked on the graph, and is annotated with the allocation of modules corresponding to that result. In these results, the o parameter ranges from 0 to 1 dencies first presented in [20] . The results and constraints of scheduling are shown in figure 16 . The prescribed clock cycle limit was 14 but the resulting schedule utilized 10 clock cycles in order to meet the given latency constraint of 10 clock cycles. The area used by the design is the minimum area possible with the given constraints.
DISCUSSION SECTION
We have proposed an algorithm which generates a scheduling, allocation, and operator binding of a dataflow graph G using modules from a library of modules which is provided by the user. Synthesis is performed within a chip area constraint and timing constraints which include clock duration and the maximum number of clock cycles. The The approach used by this system to satisfy area and delay constraints is flexible and may be extended to perform tradeoffs between other conflicting constraints. For instance, a module library could additionally capture power information. Approximation of power constraint satisfaction can be easily achieved if appropriate modeling of cumulative aspects of power is incorporated.
CONCLUSIONS
In this paper we have presented an algorithm which integrates module selection into high-level synthesis of pipelined and non-pipelined designs. Furthermore, we have illustrated an heuristic algorithm which intertwines module selection and scheduling decisions. The experimental results show that modules are selected appropriately and minimum area designs are accomplished. This system successfully performs time-and-area constrained scheduling, even under tight constraints when very few solutions are possible. This success is due both to the heuristics which guide the search through the design space toward fea-
