Abstract-The importance of effective lower bound estimation (LBE) techniques is well established in high-level synthesis (HLS) since it allows more efficient exploration of the design space while providing other HLS tools with the capability of predicting the effect of specific tools on the design space. Much of the previous work has focused on LBE techniques that use very simple cost models which primarily focus on the functional unit resources. With the push toward submicron technologies, simple models that use functional unit resources alone are not accurate enough to allow effective design space exploration since the effects of storage and interconnect can indeed dominate the cost function. In this paper, we present an integrated approach aimed at predicting lower bounds on hardware resources needed to implement a behavioral description within a given amount of time. Our area cost model accounts for storage (register) and interconnect resources (buses) in addition to functional resources. Our timing model uses a finer granularity that permits the modeling of functional unit, register, and interconnect delays. Our approach is integrated because we consider the dependencies between the different types of resources as well as the ordering in which the resources are allocated. We tested our technique for functional unit, storage, and interconnect requirements on several high-level synthesis benchmarks, and observed near-optimal results. We believe that our comprehensive LBE approach can lead to better quality HLS solutions in less time, and we demonstrate this approach in our paper.
A Unified Lower Bound Estimation Technique for High-Level Synthesis
Seong Yong Ohm, Fadi J. Kurdahi, Member, IEEE, and Nikil D. Dutt, Senior Member, IEEE Abstract-The importance of effective lower bound estimation (LBE) techniques is well established in high-level synthesis (HLS) since it allows more efficient exploration of the design space while providing other HLS tools with the capability of predicting the effect of specific tools on the design space. Much of the previous work has focused on LBE techniques that use very simple cost models which primarily focus on the functional unit resources. With the push toward submicron technologies, simple models that use functional unit resources alone are not accurate enough to allow effective design space exploration since the effects of storage and interconnect can indeed dominate the cost function. In this paper, we present an integrated approach aimed at predicting lower bounds on hardware resources needed to implement a behavioral description within a given amount of time. Our area cost model accounts for storage (register) and interconnect resources (buses) in addition to functional resources. Our timing model uses a finer granularity that permits the modeling of functional unit, register, and interconnect delays. Our approach is integrated because we consider the dependencies between the different types of resources as well as the ordering in which the resources are allocated. We tested our technique for functional unit, storage, and interconnect requirements on several high-level synthesis benchmarks, and observed near-optimal results. We believe that our comprehensive LBE approach can lead to better quality HLS solutions in less time, and we demonstrate this approach in our paper.
I. INTRODUCTION

E
STIMATION plays a central role in guiding the design tasks to optimal or near-optimal solutions. While accurate estimation is somewhat important for physical and logic design tasks, it is even more crucial when the design process is started at a higher level of abstraction. Decisions made at this level do have a pronounced impact on the final design. However, the impact of these decisions cannot be found until later in the design process. Therefore, in order for such high-level design tasks [mainly high-level synthesis (HLS)] to produce reliable results, such tasks must rely on realistic and accurate models of hardware components. Without such realistic models, it is likely to produce designs not satisfying cost and/or timing constraints, resulting in unnecessary iterations through the design cycle and increasing the design turn-around time.
Much of the earlier design prediction work assumed the existence of netlist-based design descriptions as inputs, and hence produced netlist-based estimators [1] . However, these techniques can only be used after the design data path is synthesized to provide back-end feedback. If, on the other hand, the designer starts with no feedback at all, or with incorrect feedback, then there is no guarantee that the design decisions initially made would indeed be the correct ones which would produce the desired outcome. Thus, it is very important to provide the designer with front-end feedback to provide initial guidance in making design decisions. Specifically, we need to have the capability of bounding the design space prior to starting the HLS tasks.
In order to achieve this goal, we propose an integrated approach to predict lower bound estimates on resources, given a data flow graph (DFG) description of the design and a performance goal. Our approach is more comprehensive than the previous work in the sense that our cost model accounts for registers and buses as well as functional units (FU's), and that we use an expanded timing model which can accommodate FU delays, register delay, and interconnect delays altogether.
Whereas much of the previous work in the area has focused on techniques that use very simple cost models which primarily focus on the FU resources, our cost model accounts for storage (register) and interconnect resources (buses) in addition to functional resources. Our studies of various libraries of RT level components indicate that register cost can easily surpass those of "large" functional units such as adders. Table I shows such an example from the VTI 0.8 m cell library. According to this table, the area of a register may indeed be larger than that of an adder, implying that without including register cost, we cannot accurately predict the actual area cost. In addition to register cost, studies [2] have indicated that interconnect costs can be significant. Indeed, the move toward submicron technologies will result in interconnect costs being the dominant contributor toward design performance. Thus, we need to incorporate interconnect as well as registers in estimation for better prediction.
In addition, our timing model uses a finer granularity that permits the modeling of FU, register, and interconnect delays. Table I also indicates that the register delay cannot be ignored, so we need to consider register delay also. Our approach accommodates all of these factors in estimation, along with transfer delays among these hardware resources such as bus-to-register delay, register-to-bus delay, FU-to-bus delay, bus-to-FU delay, FU-to-FU (chaining) delay, control delay, and so on. In our model, these delays are expressed in real time such as nanosecond (ns), and are used along with the total performance constraint expressed in real time. It provides a more accurate analysis of the starting and completion times of each operation in the input data flow graph.
More importantly, our approach is integrated because we consider the dependencies between the different types of resources as well as the ordering in which the resources are allocated. This results in added flexibility since designers can constrain one type of resource and predict the remaining resources. Furthermore, our technique can predict the outcome of different HLS strategies having different allocation priorities. This allows the designer to further explore the design space, analyze resource allocation tradeoffs, and perform a true "what if" analysis before committing to a particular strategy and a set of resources.
We have developed efficient algorithms and heuristics to support this model. Our initial experiments on some HLS benchmarks [3] indicate that this model is quite accurate. This estimation scheme naturally lends itself to encapsulation within system level synthesis frameworks by providing early and accurate estimates of design quality when large behavioral descriptions are partitioned onto several chips, without the need of running HLS tools to obtain full design netlists.
The rest of this paper is organized as follows. Section II describes previous and related work. Section III describes our architectural models, and defines the problem at hand, and Section IV gives an overview of the estimation algorithm. Sections V-VII describe the lower bound estimation techniques for functional unit, storage (register), and interconnect (bus) costs, respectively. Section VIII explains how to integrate these separate lower bound estimation techniques, and how to explore the design space for better estimates on the total area cost of hardware resources. Section IX presents experimental results on several HLS benchmarks. Section X concludes with a summary.
II. PREVIOUS WORK
Some of the recent work on estimating lower bounds on area cost and total control steps (or csteps) is described in [4] - [14] . All of these works (with the exception of [5] , [12] - [14] ) are mainly concerned with FU's in their area cost models. The work in [4] proposes a mathematical model for predicting the area-delay curve. An extension of this work is described in [5] , and addresses lower bounds on time and area cost including interconnect cost, but not register cost. Reference [6] proposes an ILP formulation for lower bound estimation of performance given resource constraints. Reference [7] addresses lower bounds on time and FU cost for a functional pipelined data flow graph, but neither register cost nor interconnect cost. Reference [9] also addresses lower bounds on time and FU cost. It uses these two estimation algorithms to predict the system level area-delay curve. However, it does not consider register and interconnect cost in estimation. Reference [10] presents a formal approach which seems to estimate FU cost better than [5] and [6] in some benchmarks. It considers the interdependency of the bounds of different FU types, but neither registers nor interconnects in estimation. Reference [11] finds the lower bound on FU cost, and utilizes it in finding an optimal scheduling result effectively, but registers and interconnects are not considered. Reference [12] uses an ILP formulation in calculating lower bounds on the number of FU's, registers, and buses separately. However, it does not take into account the dependencies among the number of resources of each type in estimating the lower bounds, and furthermore, the solution can be computationally expensive. Reference [13] presents an integrated area-delay prediction model which includes FU, register, and interconnect costs for use in system level partitioning. However, it does not consider interdependencies between hardware resources. Finally, [14] considers a generalized memory hierarchy scheme for a hardware/software codesign model, and predicts the sizes of the various memory components to achieve a given performance goal.
III. ARCHITECTURAL MODEL AND PROBLEM DEFINITION
In high-level synthesis, an RT-level system that consists of functional units, storages, and interconnects is synthesized from the behavioral description. We assume that each RTL operation can be executed using one type of RTL unit (e.g., an add operation can be executed by one type of adders). 1 In order to estimate accurately the number of resources required to implement the given behavior, we need to define a target architecture. In our approach, we consider two styles of target architectures as shown in Fig. 1 : multiplexer-based and bus-based architecture. Bus-based architectures are usually better suited for architectures with memories and register files, and are preferred due to the fact that interconnect is shared. In this paper, we confine our scope to bus-based architectural models [ Fig. 1(b) ]. On the other hand, estimating lower bounds on multiplexer-based architectures is much more difficult to handle due to the fact that one must consider the impact of several hard optimization steps aimed at reducing the number and size of multiplexers (e.g., switching the inputs of commutative operations). Such architectures will be considered in future work. We also assume that registers are used for storage resources. Registers can also be grouped into register files. In the bus-based architecture, we assume that each data transfer between an FU and a register occurs only through buses, and that each data value produced by a functional unit should be stored in a register through buses so that it is used in a later clock cycle. However, the data value need not be stored in a register if its source and destination operations are chained.
Since actual interconnect delays as well as the total area cost are dependent on the layout synthesis techniques such as placement and routing, we cannot easily estimate these figures. Therefore, it is currently assumed that users provide some initial estimates of these interconnect delays. Future linkage to physical level estimation tools will allow these delays to be more accurately estimated. Such linkage will also allow us to incorporate interconnects more accurately in the total area cost.
In our approach, the behavioral description, expressed in the form of a data flow graph, is given as input, and the total performance and clock period expressed in real time are given as constraints. In addition, FU delay, register delay, and interconnect delay are given by users or selected from a library. Given this information, we estimate lower bounds on the number of FU's of each type, on the number of registers, and on the number of buses (bidirectional or unidirectional).
IV. OVERVIEW OF OUR APPROACH
A. Basic Idea
Our estimation methods are based on the following principle: if objects are distributed over slots, at least objects are assigned to some slot within those slots. This can be stated slightly differently if we are talking about a time slot interval whose length is . If objects are guaranteed to be assigned within interval , at least components are required to perform these objects. That is, this value can be used as the lower bound on the number of components to perform these objects. In our context, an object may represent an operation performed by an FU, a data value to be stored in a register, or a data transfer between hardware resources, while each time slot represent a control step (or clock cycle). To estimate tighter lower bounds, we compute these values over all of the possible intervals, and then choose the maximum one as the lower bound.
While this principle has been used for FU estimation in prior work, in our work, we have extended this approach for register and bus estimations as well.
B. The Area Cost Estimation Algorithm
In this paper, denotes the earliest (latest) cstep in which operation can be started without violating both timing constraint and precedence relations between operations, and denotes the last cstep where operation is completed when it is scheduled in cstep. In determining these values for each operation, we take into account the predefined transfer delay including the delays of registers and interconnects, along with the delay of the FU itself, thus providing a more sophisticated timing model. In this paper, the cstep interval is called the time frame of operation . We estimate the FU cost, register cost, and bus cost using these time frames. Our scheme, however, is flexible to allow modification of the estimation order among FU's, registers, and buses, allowing better design space exploration. We assume that each estimation step is independent for the purpose of explanation only. The integration of these estimation steps and the effects of the ordering will be described later in Section VIII. Fig. 2 shows the overall structure of the area cost estimation algorithm LBE, when FU cost is first estimated, register cost next, and bus cost finally.
We illustrate our approach with a walk-through example. Fig. 3 shows an input DFG of the differential equation example [3] which solves the second-order differential equation
, and Fig. 4 shows the initial time frames of the operations when the total performance (maximum delay) is 80 ns and the clock period is 20 ns. In this example, we assume that the total interconnect delay between an FU and a register is given as 4.5 ns, and that additions, comparisons, and subtractions are executed by ALU's with 10 ns delay and multiplications by multipliers with 15 ns delay. 
V. FU COST ESTIMATION
A. Basic Technique
If operations of type are scheduled in the interval , then at least FU's of type are required. Now, given a particular operation , then, clearly, is guaranteed to be scheduled in if its time frame . For each interval , we find the number of operations of type guaranteed to be scheduled during that interval, and estimate the lower bound on the number of FU's of type . After that, we enumerate all of the cstep intervals total cstep number to get a tighter lower bound. The maximum such number over all enumerated intervals yields an estimated lower bound on the FU cost of type . Fig. 5 illustrates how the lower bound on the number of FU's of each type is estimated. For example, the four multiplications , and in Fig. 4 are guaranteed to be executed in cstep interval since their time frames are fully included in this interval. So is estimated as a candidate lower bound on the number of multipliers for this interval. To get a tighter lower bound, these candidates are estimated over all of the cstep intervals, and the maximum one is selected. In this case, is finally chosen as the lower bound on the number of multipliers. In a similar way, the lower bound on ALU count is estimated as two.
B. Generalization
The above basic idea is generalized to support multicycling, chaining, and functional pipelining of operations. Fig. 6 shows an extended version of the FU cost estimation algorithm. In this algorithm, denotes a set of operations of type , and is the intersection of interval and the earliest (latest) cstep interval during which may be performed by an FU of type .
is guaranteed to occupy an FU for as many csteps as at least in the interval . Therefore, represents the total number of FU slots required in the interval , and yields a lower bound on the number of FU's of type estimated for interval . We enumerate all of the cstep intervals to get a tighter lower bound, and then select the maximum one as the lower bound on the number of FU's of type .
For each nonpipelined operation with delay of less than or equal to one cstep, and are exactly the same as and respectively. Therefore, we can apply the same algorithm for chained operations as well as multicycling operations without any modification.
As for the pipelined operations, we may have to count the number of slots required for each stage in order to estimate the number of pipelined FU's, assuming that each -stage pipelined FU is divided into separate operations. In this algorithm, however, we only have to consider the first stage of them in estimating the lower bound on the number of the pipelined FU's. This is because the number of slots required for stage in interval is less than or equal to the number of slots required for the first stage in another interval while . Therefore, for each pipelined operation , we assume that and are equal to (i.e., of the first stage) and (i.e., of the first stage), respectively, and then apply the same algorithm. An estimated lower bound on the total FU cost is derived by applying the above procedure for all of the types.
C. Refinement of the FU Estimates
Sometimes our estimated lower bounds are loose since our search may include some nonfeasible situations. However, we can use those bounds as an initial solution which is further refined to obtain tighter lower bounds. Fig. 7 shows how to tighten the lower bound. We use the lower bound obtained so far as the initial solution, and iteratively apply a relaxationbased technique to improve the solution into a tighter lower bound.
Basically, we assume that as many FU's are used as its initial lower bound, and then adjust the time frames of operations using this FU constraint. We temporarily schedule each operation into its ASAP cstep, and then calculate the lower bound again. When estimating the lower bound again, we may need to consider all of the possible intervals in [1, total cstep number] to get a tighter bound. In this procedure, however, we consider only one cstep interval instead of considering all of the possible intervals to improve the speed of the reestimation. We select interval in this procedure since only the time frames of the predecessors of may be affected by scheduling into . represents the reestimated lower bound on the number of FU's of type for the interval . Once the new lower bound is estimated, we compare it with the FU constraint. If the new one is larger than that constraint, it implies that could not be scheduled into under the FU constraint, and thus we need not include in 's time frame. Therefore, in that case, we can increment and by one. In a similar way, we adjust the cstep of each operation. For example, the initial time frame of is [1, 2] as shown in Fig. 4 . However, if we assume that the number of multipliers available is equal to the lower bound on the number of multipliers (2), cannot be scheduled into cstep 1 since at least two other multiplications and are guaranteed to be scheduled into cstep 1. Therefore, the time frame of shrinks to
. In a similar way, we can adjust the times of the operations. Fig. 8 shows the time frames of the operations modified by this update procedure. In this example, time frames of operations and are modified compared with those in Fig. 4 .
Once the time frames are adjusted, we try to see if both the FU constraint and the performance constraint already set are satisfied by checking whether all of the time frames are still valid 2 or not. If not, then we need to relax the FU constraint. So we increment the lower bound by one and repeat above procedure. In other cases, we estimate a new lower bound, again using the adjusted time frames. If we can find a tighter lower bound, it contradicts the assumption that as many FU's are used as the initial lower bound. So, we increment the initial lower bound by one, and repeat the above procedure with this increased lower bound as the new FU constraint.
D. Time Complexity of the FU Cost Estimation Algorithm
The time complexity in estimating the initial FU cost is and that in refining the estimation is roughly , where and represent the number of operations and the number of edges in the given DFG, respectively, and is the total number of csteps. Thus, the total time complexity of this algorithm is .
VI. REGISTER COST ESTIMATION
A. Basic Technique
The main difficulty in estimating the register cost arises from the fact that no prior scheduling is assumed. This means that the lifetime of each data value (variable) is not known a priori since it is only known once scheduling is performed. Therefore, the estimation approach should consider all of the possible lifetimes of all of the variables. Fig. 9 shows our register cost estimation algorithm. In this algorithm, denotes a data value from operation to . The operation is called the source of variable , and the operation the destination of the variable. The weight of 2 We say that the time frame of operation O i is valid only if ASAP i ALAP i . is the portion of variable 's lifetime guaranteed to be included in the interval , taken over all possible schedules. Since is the total number of register slots required for interval , represents the minimum number of registers required for that interval, that is, a lower bound on the register count. We enumerate all of the cstep intervals total cstep number to get a tighter lower bound, and then select the maximum one as the lower bound on the register count.
B. Improvements
This basic technique, however, suffers from a serious drawback: in most cases, the weight of a variable is usually too small, and thus this basic procedure would yield a trivial lower bound on the register cost. In order to alleviate this problem, we need to find the largest possible weights for the variables. In this paper, we apply two improvement techniques, called fan-out reduction and variable merging. be considered when estimating the register count since the lifetime of is always included in that of . For example, if or is data dependent on , then we only have to consider in our basic procedure. This technique will help reduce the total problem size, and also will simplify the variable merging problem explained later. Fig. 10 illustrates how the fan-out reduction technique is applied to a simple example. As an example, the two fan-outs and in this figure have the same source , but different destinations: and respectively. Since " " is satisfied regardless of scheduling results when the total number of csteps is 4, the lifetime of fan-out always includes that of the other fan-out . So we do not consider the fan-out for estimation purposes only. In a similar way, we do not consider fan-out which is covered by fan-out .
2) Variable Merging: Since our goal is only in counting the number of registers, we do not need to assume a particular register binding. In this case, the lifetimes of some variables can be merged for the purpose of estimation only. For example, if is not pipelined, two variables and can be merged as a new variable, say , with a larger weight since they cannot be active at the same time and their merged lifetime covers their previous lifetimes. This modification helps increase in Fig. 9 , and also reduces the number of variables or problem size. In this technique, shared variables are not considered for merging. However, by the fanout reduction procedure described above, we can reduce the number of shared variables.
An example of variable merging is shown in Fig. 10 . Note that neither nor is individually guaranteed to be included in interval [3, 3] since operation can be scheduled in cstep 2 or 3. However, we can easily recognize that one of these two variables (either or ) is guaranteed to be active in interval [3, 3] regardless of the scheduling results of operation . This implies that at least one register is required for one of these two data values (either or ) during interval [3, 3] . Therefore, we merge the two variables and into a new variable, say . Note that this merging is done in order to improve the estimation quality, and does not necessarily imply an assignment of values to registers. In a similar way, variables " and " or " and " can be merged.
C. Refinement of the Register Estimate
As in the FU estimation case, the initial lower bound on register count can be refined further to obtain a tighter lower bound. In this case, the maximum number of values which can be active during the same cstep is restricted to the initial lower bound on register count. Fig. 11 shows the algorithm which refines the register estimate. Note that this algorithm is very similar to that in Fig. 7 which refines the lower bounds on FU counts. In this algorithm, however, registers are assumed to be used as many times as the initial lower bound on register count.
D. Register Files
If registers are grouped into a register file, then the estimation can still be used to estimate the register file size. However, there may be some restrictions on the maximum number of simultaneous accesses to the register file. Say the available register files can have at most ports. For a given performance constraint, our algorithms can estimate the maximum number of simultaneous accesses to memory over all control steps, . Then we can estimate that at least Est RegisterFiles register files are needed to accommodate the performance requirements. Since no assignment of values to storage has been accomplished, we must assume that storage is distributed equally among the register files. Thus, 
E. Time Complexity of the Register Cost Estimation Algorithm
The time complexity is in calculating the weights of variables and in the improvement steps, in estimating the initial register cost, and in refining the register cost, where represents the number of operations, the number of values to be stored in registers, and the total number of csteps. Thus the total time complexity of this algorithm is .
VII. BUS ESTIMATION
A. Basic Technique
In this paper, we assume that each data transfer between an FU and a register occurs only through buses. Therefore, the number of buses required is determined by the maximum number of concurrent data transfers via buses. We find the maximum number of concurrent data transfers from registers to FU's and then that from FU's to registers. Fig. 12 shows our bus cost estimation algorithm. In this algorithm, is defined as the set of data values which should be stored in registers. A variable whose source and destination can be chained into the same cstep need not be stored in a register; such a variable is not included in . denotes the set of data values to be transferred from registers to FU's during cstep interval , and represents the set of data values generated by FU's and transferred to be stored in registers during interval . Therefore, represents a lower bound on the number of data transfers from registers to FU's during interval . Similarly, represents a lower bound on the number of data transfers from FU's to registers during interval . We enumerate all of the cstep intervals total cstep number to get a tighter lower bound, and then select the maximum value over all such intervals as the lower bound on the corresponding bus count.
in the algorithm, however, may include more than two different fan-out variables which represent the same data value. If such data values are used during the same cstep, we may need only one data transfer for those values, reducing the size of concurrent bus access. In order to detect such fan-out values, we analyze the time frames of the destinations of the fan-out variables, and determine whether they always use the data value at the same cstep.
The total number of buses can be estimated using these two lower bounds according to the bus style in the target architecture: separate unidirectional buses (input buses and output buses) or bidirectional buses (I/O buses). We add the two lower bounds for the unidirectional bus style, while choosing the maximum of them for the bidirectional bus style.
B. Refinement of the Bus Estimates
Just like the refinement of other resource estimation, the initial lower bounds on bus counts can be refined further to obtain tighter lower bounds. In this procedure, however, as many input/output buses are assumed to be used as the initial lower bound. Fig. 13 shows the algorithm for tightening the bus estimates. This algorithm is very similar to those in Figs. 7 and 11 which are used for refining the lower bounds on FU counts and on register count, respectively.
C. Time Complexity of the Bus Cost Estimation Algorithm
The time complexity is in obtaining and for all of the intervals and in refining the bus cost, where represents the number of operations, the number of values to be stored in registers, and the total number of csteps. Thus, the total time complexity of this algorithm is .
VIII. INTEGRATED RESOURCE ESTIMATION
In the previous sections, we assumed that the estimation algorithms for each class of resources (i.e., FU's, registers, or buses) are applied independently given a data flow graph and timing constraints. However, since resource requirements are interdependent, such a strategy could lead to overall estimates representing nonfeasible solutions. Thus, we need to consider the dependencies between the different types of resources to get more realistic estimates.
As an example, consider the simple data flow graph shown in Fig. 14(a) , which consists of two additions, one multiplication, and five values to be stored in registers. For this example, there exist only two possible schedules shown in Fig. 14(b) and (c), respectively, if the total cstep number is two. The schedule in Fig. 14(b) requires two adders, one multiplier, and two registers, whereas that in Fig. 14(c) requires one adder, one multiplier, and three registers. This means that the lower bounds on the numbers of adders and multipliers areone1, respectively, and that on register count is two since we consider all of the possible schedules in our estimation. However, there is no possible schedule which can be implemented by one adder, one multiplier, and two registers. Therefore, in this case, it is more realistic to estimate one adder, one multiplier, and three registers as lower bounds if the area cost of register unit is less than that of the adder, or to estimate two adders, one multiplier, and two registers if the area cost of the register unit is larger than that of the adder.
In order to obtain more realistic lower bounds, we estimate a conditional lower bound on a type of resource subject to other types of resources rather than an absolute lower bound. We assume that resources of a particular type are allocated as many times as the estimated lower bound at the estimation step for that type. This constraint may restrict the time frames of the operations and the lifetimes of variables, and thus affect the estimation of other resources.
Consider the example in Fig. 14 again, assuming that the FU cost estimation is followed by register cost estimation. When we estimate the register cost for that example, we assume that one adder and one multiplier are used since their lower bounds estimated in the FU cost estimation step are one each. This constraint forces addition operation to be scheduled in cstep 2, and also fixes the lifetimes of input/output values . From the fixed lifetimes, our algorithm estimates three registers as the lower bound on storage.
IX. EXPERIMENTAL RESULTS
A. Implementation
We implemented our integrated estimation approach in the C language on a SUN Sparcstation. The structure of the system is shown in Fig. 15 . The behavioral description of the target design is specified in the form of data flow graph (DFG). A constraint on the total execution time is also specified by the user. A library of components provides the area and delay data for each component, including registers and estimates of bus delay.
Since our integrated approach incorporates the dependencies between the different types of resources, the ordering of estimation is very important. Such an ordering may be determined by the area cost of each resource type or by users. By default, the system assumes that the resource type with the largest area cost is estimated first, then the second largest, and so on. Our scheme provides flexibility to modify the estimation order, so designers can constrain one type of resource and predict the remaining resources subject to that constraint. Furthermore, our technique can predict the outcome of different HLS strategies having different allocation priorities. Additionally, the designer may choose to individually constraint each type of resource, and run the system to estimate the remaining resources subject to his/her constraints. This allows the user to further explore the design space, analyze resource allocation tradeoffs, and to perform true "what if" analysis prior to running the synthesis tasks. Once resource estimates are obtained, the user may use those as constraints to run the synthesis tools themselves, thus ensuring predictable results, and reducing the overall number of iterations in running the synthesis tools. This flexibility is unique to our estimation approach.
B. Experimental Setup
In order to validate our lower bound estimation system, we applied it to some well-known high-level synthesis benchmarks. The first example is the second-order differential equation solver [3] which consists of six multiplication operations, two additions, two subtractions, and one comparison. In this example, we assumed that multiplications are performed by multipliers, while additions, subtractions, and comparison are performed by ALU's. The second example is the fifth-order elliptic wave filter (EWF) [3] which consists of eight multiplications and 26 additions, and the third example is the AR filter [21] which consists of 16 multiplications and 12 additions. In these two examples, multiplications are assumed to be performed by multipliers and additions by adders. The fourth example is the discrete cosine transformer (DCT) consisting of 16 multiplications, 25 additions, and seven subtractions. The final example is the Jacobian transformer [23] . It is a large example consisting of 396 multiplications, 252 additions, 18 subtractions, and 72 memory accesses for table look-up. In these two examples, we assumed that multiplications are performed by multipliers while additions and subtractions are performed by ALU's. Our delay and timing assumptions were as follows: we assume that the clock period is 20 ns, the maximum transfer delay between an FU and a register is 4.5 ns, and the chaining delay (transfer delay between two FU's when they are chained) is 0.2 ns. Some design information is specified in the tables. If not explicitly specified, the library in Table I is assumed to be used. We also assume that the area cost and delay of a pipelined multiplier are the same as those of a nonpipelined multiplier, respectively. The CPU time for each experiment ranges from 0.1 s to 1 min on a SUN 4 workstation.
C. Experiment 1-Effects of Refinement Steps
A first set of experiments was performed to determine the effect of refinement steps. Table III shows the effects of our refinement steps. In this table, estimation results with and without refinement steps are shown, respectively. As an example, when the performance constraint is set to 120 ns, one ALU and three multipliers are estimated without our refinement step for the differential equation circuit, but one more ALU is estimated WITH our FU refinement step. Moreover, one more register is estimated with the register refinement step. Notation b denotes that the lower bound (Table IX) .
on the number of input buses is while that of the output buses is .
D. Experiment 2-Effects of Ordering on Estimation Results
A second set of experiments was performed to gauge the effect of changing the order of estimation on different examples. Since there are three classes of resources (FU, register, and bus), there can be six orderings of the estimation sequence. For each benchmark, we first set up an initial constraint on execution time. Next, we ran the system six times each with a different estimation order. This is repeated for different constraints on execution time. Tables IV-VIII summarize our estimation results on these benchmarks, and the effects of changing the ordering of estimation. In these tables, a character string consisting of "F," "R," and "B" implies the ordering of the estimations. For example, "FRB" means that FU's are estimated first, then registers with FU constraints, and finally buses with FU and register constraints. In these experiments, we assumed that all input and output data values are considered to be stored in registers.
As shown in these tables, the ordering of the estimations does affect the estimation results in many cases. Table IV , for example, shows that two adders, three multipliers, six registers, and four bidirectional buses are estimated if FU resources are estimated first, while one adder, four multipliers, seven registers, and three buses are estimated if buses are estimated first, when the maximum delay is 120 ns. This result implies that it takes a higher bus cost to implement the given behavior with less FU cost, or conversely, it may take a higher FU/register cost to minimize the bus cost. In contrast, in Table V , we obtained the same results regardless of the estimation order in most cases (except when the total performance is 420 ns), implying that this example may have fewer potential tradeoffs among FU's, registers, and buses. Using our approach, users can explore the design space by attempting different orderings and observing the results. 
E. Experiment 3-Assessing the Quality of the Estimates
To demonstrate the quality of our estimation, we compared our results with actual designs obtained through conventional scheduling and allocation processes. We compared our results with actual designs generated by HAL [17] , InSyn [18] , ALPS [19] , OASIC [12] , ILP [20] , and SALSA [22] . Tables IX-XIV show the comparison results. Note that some of these systems did not optimize any resources other than functional units, in many cases resulting in designs which may be sub-optimal with respect to registers. In such cases, we attempted to improve the FU-optimal designs by generating some so-called "hand-optimized designs." These "hand-optimized designs" are obtained by generating several schedules that are optimal in the number of FU's using our scheduling system [11] , and then choosing the one with the minimum number of registers. We generated such designs only when we concluded that designs reported in the literature optimized only FU's or we could not find any comparable designs published in the literature.
Since previous synthesis systems used a fixed ordering of allocation (i.e., FU's first, then registers, and buses finally), we applied our estimation in the same order to enable comparison. We note, however, that some synthesis systems do not report results on all three classes of resources. For example, ALPS has no experimental results on registers, while ILP has no results on buses. However, they are described here just for comparison with our estimation results on the remaining resources.
Overall, these experimental results clearly show that our estimates are quite accurate. Indeed, the FU estimates are exact in most cases. Furthermore, the estimated lower bounds on register and bus count are also quite accurate as shown in Tables IX-XIV . In most cases, the lower bound on register (Table X) . (Table XI) .
(bus) cost is one or two units below the actual register (bus) count.
As an example, in Tables X and XI, the numbers of FU's of each type and registers in our estimation are the same as those of actual designs generated by OASIC and InSyn. It implies that, for this example, OASIC and InSyn generated designs which are optimal in terms of FU's and registers, respectively, and that our estimation is very accurate as well, especially given that no prior scheduling is assumed in our estimation. On the other hand, other systems, such as SALSA, do not optimize the number of register, and hence the reported designs have more registers that what was predicted by the estimation.
In order to investigate the importance of considering registers as well as FU's in estimating design cost, we show in Figs. 16-21 some of the comparison results in a graphical form. In these figures, the axis shows the total performance (Table XII) . (Table XIII) . (Table XIV) .
or maximum delay, and the axis compares the area cost occupied by FU's and registers for the designs in Tables IX-XIV  and the corresponding estimation results . 3 In addition, "FUonly models" show the area estimates obtained by accounting for FU's only. These charts clearly show that by using the FUonly model, estimations are highly inaccurate, and thus verify the importance of register estimation.
X. CONCLUSION
In this paper, we motivated the need for a more comprehensive lower bound estimation algorithm that takes into account not only functional unit costs, but also the costs of storage and interconnects. We also motivated the need to account for a better characterization of time (in terms of real nanoseconds per clock), as opposed to traditional methods that only considered the number of clock cycles. Our estimates of hardware resource requirements are quite accurate, and they validate our approach for these examples. Our approach is integrated and flexible, and accounts for the dependencies in the ordering of the FU, register, and bus estimation. Thus, our system allows easy exploration of the design space, and evaluation of alternatives prior to committing to an architecture.
Currently, our system does not consider multiplexer and wiring area and delay in its estimation. In addition, it does not allow loops and branches in the input behavioral description. These topics and others will be addressed in future work.
