Abstract-This paper presents a new scheme for scheduling and control synthesis in high-level circuit design. The scheduling algorithm tries to maximize the performance of a design under resource constraints by maximizing the utilization of resources and minimizing clock slack. It exploits the technique of bit-level chaining (BLC) to target high-speed design. It also exploits noninteger multicycling and chaining, which allows multiple cycle execution of a set of chained operations and even sharing of chained functional units to obtain further performance at the cost of a small increase in the complexity of the control unit. Experimental results on several datapath-intensive designs show significant improvement in throughput over the conventional scheduling algorithms.
interval could adversely affect the performance of the synthesized design because such a choice could increase slack times wasted in control steps. To solve this problem, several techniques have been proposed for optimal selection of the control-step interval. In [5] [6] [7] , the control-step interval is determined for a given control data flow graph (CDFG) prior to scheduling by exploring all possible values from lower bound to upper bound with a given quantization step. In [8] [9] [10] , the optimal interval is determined after scheduling by adjusting the clock period for a scheduled design. The former approach can take a long time due to a large search space and can miss the optimum point since it may lie between two adjacent search points. The latter approach is unlikely to achieve the optimal interval due to a narrow search space limited by the prior scheduling. Furthermore, since such approaches are not tightly integrated into scheduling, they cannot consider the effect of multicycling and chaining, which may be exploited during scheduling. In fact, it is not optimal either to determine the control step prior or posterior to scheduling.
Most scheduling techniques use a simple timing model that represents timing information, such as operator execution time, expressed in clock cycles [1] , [3] , [11] [12] [13] [14] . Although such a model simplifies the scheduling task, the timing resolution is too low to be used for high-performance design. Many improved timing models that specify the timing information in absolute time have been proposed. Among these models, the worst-case delay model is commonly used to represent the timing behavior of operations [2] , [15] , [16] . With this model, operator chaining can be used to minimize the slack time in each control step. However, the model usually leads to overestimated delays for most operator chains since it ignores the fact that the worst-case delay of a whole operator chain is usually smaller than the sum of the worst-case delays of the operators in the chain. To overcome this problem, the concept of a complex functional unit, which represents a set of chained operations, has been introduced in the form of a cell library [7] , [17] . The worst-case delay for each chained operation is calculated with detailed delay analysis. To benefit from these complex functional units, scheduler tries to map them onto the operations in a CDFG. However, such techniques require many kinds of complex functional units to cover all possible chains and as the set of complex functional units becomes larger, the computational complexity increases due to the larger search space. Moreover, the resulting designs tend to be large because such resources can rarely be shared by the operations [18] .
B. Focus of the Paper
We can calculate the delay in chained operations accurately by using simple timing models and delay analyses. In this paper, 0278 -0070/01$10.00 © 2001 IEEE we present such a model for delay calculation and show how to incorporate it into the proposed scheduling algorithm for datapath-intensive design. Basically, our approach searches for minimum latency under resource constraints through maximal utilization of resources by minimizing the waste of clock slack and applying the technique of bit-level chaining (BLC) [7] , [17] , [18] .
We determine the control-step interval during scheduling. The scheduler does not assume that the control-step interval is fixed. It does not even assume that the intervals are identical (i.e., the intervals of control steps in the same design may differ from each other). Scheduling determines the intervals of each control step in addition to assigning operators to control steps. The clock period is determined after scheduling and the control-step intervals are quantized to become multiples of a clock period (or cycle time) for implementation with synchronous circuits. In this approach, each control step takes one or more clock cycles, unlike the conventional scheme where each control step takes just one clock cycle. The division and quantization process is carried out to reduce the waste of clock slack and results in high-speed clocking. However, this scheme tends to increase the number of states in the finite-state machine (FSM) that implements the control unit of the design. We also investigate this problem so that the control unit may nevertheless be synthesized efficiently.
C. Paper Organization
This paper reports work aimed at performance optimization for datapath-intensive designs through a novel scheduling and control synthesis approach. The motivation and basic idea are explained in Section II. Section III defines the new approach to HLS, which is different from conventional schemes in that scheduling is done with flexible control-step intervals and clock selection is done after scheduling. Section IV shows how BLC can be used for shortening latency. In Section V, we explain the details of our scheduling algorithm with an example. We present an optimization technique for control synthesis in Section VI. The target architecture of the high-level design is investigated from the viewpoint of high-performance design in Section VII. Section VIII presents experimental results and compares them with other approaches. Section IX concludes this paper.
II. MOTIVATION AND BASIC IDEAS
Scheduling in HLS takes a crucial role in determining the performance of synthesized hardware. As a performance measure, we can use the data introduction interval (sample period) when pipelining [12] [13] [14] is used. For a design without pipelining, we can use latency as the performance measure. Although our approach can be extended to a pipelined design, we focus on nonpipelined design in this paper and use latency as the performance measure. Given a CDFG, the minimum latency without resource constraints can be computed by summing up all delays of the operations on the critical path. In practical synchronous circuit implementation, however, the latency is determined by the clock period and the number of clock cycles, which can be much greater than the total delay of the critical path. Even the delay of the critical path may be greater than the minimum that we can achieve when we consider BLC because BLC enables a chained operation to start execution before the completion of the preceding operations.
A. Intercontrol-Step Imbalance
In general, the delays of functional units differ according to the operation type, bit size, implementation style, and so on. With a primitive scheduling technique, where one operation takes one control step, we must set the clock period to the largest operation delay, which causes a large slack to be wasted in control steps that contain only faster operations. To reduce the waste due to intercontrol-step imbalance, multicycling and chaining techniques are used during scheduling [7] , [18] or after scheduling [9] , [10] . These techniques effectively improve the overall performance. However, the improvement is limited since these techniques only allow multicycling over an integer number of cycles and the chaining of an integer number of operations. A technique for system clock optimization has been proposed to allow noninteger multicycling and chaining [8] . This technique guarantees to optimize clock period under a given schedule. However, because the scheduling is done independently, it allows only a limited amount of improvement.
B. Intracontrol-Step Imbalance
Some of the operations in a control step may take longer than other operations in the same control step. Such imbalance keeps some functional resources idle until all the operations in the control step are completed. Consider the design shown in Fig. 1 , which performs two multiplications and three additions. Assuming that the adder delay is about two thirds of the multiplier delay , the operations are scheduled into three control steps using one multiplier and one adder, as shown in Fig. 1(b) , which gives the minimum number of control steps that can be achieved by conventional scheduling.
However, this schedule may not be the best in the sense that resources are not fully utilized. Since the time interval of each control step will be determined by the delay of the multiplier, the adder will have idle time at all control steps. Moreover, the multiplier is idle at the third control step. The design can be scheduled as shown in Fig. 1(c) , which takes four control steps (their intervals may not be the same) but utilizes the adder and multiplier without idle time. Through a system clock optimization process, the design can be scheduled in six clock cycles with a clock period of without time being wasted by clock slack. The execution time of the design can be shortened from to .
C. Bit-Level Chaining
Maximizing resource utilization can be considered as one of the scheduling goals. However, complete utilization may not be achieved due to data dependency. Data dependency prevents concurrent execution of operations and increases the idle time during which resources perform no valid computation, thereby slowing down the hardware even with abundant functional resources. For a high-throughput design, pipelining techniques can be used at the cost of increased hardware. BLC is another way of increasing throughput, which also exploits concurrent execution of operators but without increasing the hardware cost of the datapath. For example, the total delay of chained-ripple carry adders is less than the sum of their worst-case delays since the chained adders start execution before the completion of the preceding adders and the carry chains in those adders are concurrently executed. Even though chained operators have data dependency, they can be executed concurrently at the bit level.
III. PROBLEM FORMULATION
Research on HLS has mainly focused on synchronous designs, which form the mainstream of contemporary design practice. Scheduling is based on control steps, time is measured in clock cycles, and cycle time is determined by the worst-case delay of the operations executed in a control step [11] . That is, conventional schedulers assign operations having delays (specified in number of clock cycles) to control steps, each of which has an interval of one clock cycle where the clock cycle time is determined prior to scheduling. Such scheduling cannot avoid the unnecessary idle time of resources caused by the imbalances mentioned in the previous section. We need to consider multicycling and BLC simultaneously and determine the control-step interval and the clock cycle time during or after scheduling.
For a conventional synchronous design, a control step corresponds to a clock cycle. That is, the interval of a control step takes the time of one clock cycle and the number of control steps becomes the number of clock cycles. All control lines from the control unit to the datapath retain their values within a control step. Extending this concept, we consider a control step to cover an interval where no control lines change their values. However, the intervals of control steps need not be equal to the cycle time. As long as all the constraints-including resource constraints-are met, we can prolong the control steps. A control step can be arbitrarily long and the intervals of all control steps may not be identical. For realization as a synchronous system, we determine the clock period and quantize the control-step intervals with the clock period as the quantization step. Note that each control step may take more than one clock cycle.
A. Feasible Schedule
The time we maintain during scheduling is the absolute time rather than the number of clock cycles. Scheduling an operation in a CDFG requires the determination of an interval bounded by two absolute time points: start time and end time. Since an interval represents the execution time of a scheduled operation, it is obvious that two operations scheduled to overlapped intervals cannot share the same functional unit. Therefore, the maximum number of overlapped intervals must not exceed the resource constraint. With a given CDFG , where (set of nodes) and (set of edges) represent the operations and the data dependencies respectively, a feasible schedule is defined as the following.
Definition 1 (Feasible Schedule): Given a graph , where each node has an assigned resource type and its own delay , a feasible schedule is an assignment of each node to a time point (the start time of ), satisfying the data dependencies represented by and satisfying the resource constraints represented by where if otherwise and is the number of allocated resources of type . Regarding the resource constraints, it is sufficient to check them at the start time of each node.
B. Optimal Scheduling
Scheduling can be performed without a predefined control-step interval. The start time of each operation is not mapped to a control step but to an arbitrary time while satisfying data dependencies and resource constraints. The boundaries of control steps are determined arbitrarily provided that the result satisfies the resource constraints-the number of resources activated in a control step cannot exceed the number of allocated resources. Given a CDFG and resource constraints, the scheduling objective is to find a feasible schedule with minimum latency. Assume that there are resource types and the number of resources of type is bounded from above by . Ignoring the clock selection and quantization effect, the scheduling problem can be defined as the following:
Definition 2 (Performance-Driven Scheduling [PDS] ): Given a graph and a set of resource upper bounds , the problem is to find a feasible schedule such that the latency given by is minimized, where and represent the total number of control steps and the interval of th control step, respectively, while satisfying the resource constraint given by where if or otherwise and is the start time of th control step ( ).
C. Clock Selection
Setting the clock period and quantizing the control-step intervals introduces additional clock slacks, increasing the latency. Given the control-step intervals, we determine the clock period in such a way that the total clock slack time is minimized. A minimum can be achieved simply by making the clock period infinitely short. However, shortening the clock period makes the control steps take many clock cycles, resulting in a large control logic and high power consumption. In any case, implementation may not be possible at such high speeds; we assume that the lower bound of the clock period is given. With this lower bound, we need to find an optimal clock period that minimizes the latency increase. Ties are broken by taking the longer clock period. The clock selection problem is defined as the following Definition 3 (Clock Selection): Given a set of control-step intervals , if we set the clock period to , then the total number of clock cycles is given by where the ceiling operation reflects the quantization effect. Due to this effect, the actual latency would increase beyond the ideal latency and is given by The clock selection problem is to determine a value of that minimizes subject to where represents the minimum clock period. The definition above is based on the assumption that any clock period is acceptable provided that it is longer than the lower bound. If the clock period must fall into some specific range because of interaction with other systems, we need to constrain the clock period further. However, the basic concept still applies and the proposed method would work with slight modifications.
IV. BIT-LEVEL CHAINING
In general, the delay of a register-transfer-level (RTL) unit represents the critical path delay (i.e., the longest one among all pin-to-pin delays). However, some units exhibit very different pin-to-pin delays depending on the location of the pins in the input or output data words. Examples of such units are adders and multipliers. For a ripple carry adder, the value of the most significant bit (MSB) of the sum appears at the output later than the least significant bit (LSB) because the former may be affected by the computation results of the latter; this is called the rippling effect [19] .
Due to the existence of the rippling effect, outputs may be on time even when the inputs on the MSB side are delayed provided that the inputs on the LSB side are applied on time. Therefore, if two units exhibiting the rippling effect are directly connected in cascade, the execution of the pair may take less time than the sum of the worst-case delays of the units. To exploit such potential, we need to calculate delays at the bit level. Then, through BLC, we can find out the minimum execution time of chained units for a given CDFG.
A. Operator Delay Model
An example of BLC is shown in Fig. 2 . There are three operators " ," " ," and " ," which represent multipliers, subtracters, and adders, respectively. The number on the right of each operator symbol denotes the bit position. Each bit of the chained operators starts execution independently and instantly after the input value is ready. As a consequence, the timing boundary at the input and output of each operator looks like a staircase, which makes the execution of operators overlap as depicted by the gray area in Fig. 2 .
We can exploit execution time overlaps to shorten the total execution time. To allow BLC during the hardware synthesis process, we need a delay model that is more accurate than the simple worst-case delay model. The delay of a unit with rippling effects can be represented more accurately by the ripple delay and the bit delay of each bit [20] . Let us define the terminologies and notations used in our delay model.
• The bit delay of the th bit ( ) represents the computation time of the bit itself, which is the critical path delay from the input pin to the output pin corresponding to the bit position.
• The ripple delay of the th bit ( ) represents the delay caused by the rippling effect between two adjacent bits: the th bit and the ( )th bit.
• The input arrival time of the th bit ( ) represents the latest time when the th bit must start its computation if it is not to delay the output.
• The output departure time of the th bit ( ) represents the time when the th bit value appears on the output pin.
• The start time of an operator ( ) represents the time when the operator is scheduled to start its computation. That is (1) • The relative input arrival time of the th bit ( ) represents the start time of the th bit relative to the start time of the operator. That is (2) where at least one of the times is zero.
• The relative output departure time of the th bit ( ) represents the time (relative to ) when the valid bit value appears on the output pin corresponding to the bit position.
• The worst-case delay of an operator ( ) represents the longest pin-to-pin delay. Throughout this paper, we assume that the bit-level scheduling of each operator is done correctly. That is (3)
• When the above notation is used for a specific operator , we add a superscript to the symbols. The timing boundary of an operator can be calculated from the delay model. That is, the relative input arrival time and relative output departure time for each bit can be calculated using the ripple delay and bit delay. For example, the th bit relative output departure time of an operator is the sum of the ( )th bit relative output departure time and the ( )th bit ripple delay. Recursively, the time can be calculated for all bits by (4) provided that for all , which is true in most cases. The value of (the relative input arrival time of the th bit) required to achieve the relative output departure time can be calculated simply by (5) We can apply the same concept to functional units with different characteristics. For example, a comparator may exhibit rippling in the opposite direction and the staircase pattern may be reversed.
B. Delay Calculation
Most arithmetic operators in the CDFG exhibit rippling effects in the direction from LSB to MSB. Such operators take the relative output departure time of the MSB as the worst-case delay. However, rippling effects may not exist for bitwise operators or may exist in the reverse direction for comparison operators. In some cases, the relative input arrival times may not even increase or decrease monotonically along the bit positions. Therefore, we should take the largest time among the relative departure time of all bits as the worst-case delay of an operator.
The overlap interval of two chained operators should be obtained by examining all connections. For example, the overlap interval of chained operators and can be calculated as the following:
The total delay obtained by summing the worst-case delays of the chained operators can be shortened as much as the length of the overlap interval. In this case, we note that the net delay of operator decreases from to (7) Now, let us define the critical path length of a node in a CDFG to be the longest path delay from node to the end node. It can be calculated by accumulating the net delays of the operators on the path or can be calculated recursively by (8) The total execution time of a CDFG can be obtained from (8) for all nodes following the data dependencies from the end node to the start node. The time obtained by this manner is the minimum delay that we can achieve and it can be much less than the sum of the worst-case delays of operators on the critical path.
C. Critical Path Identification
Let us take as an example the differential equation solver [3] , whose CDFG is shown in Fig. 3 . We will assume that ns, ns, and ns. The minimum execution time can be found by calculating the critical path length for all nodes. To simplify the explanation, we will also assume that the ripple delays for all operations in the CDFG are the same. Inserting an adder or a subtracter into a chain increases the total delay by only one bit delay because most of the execution time is overlapped. However, we assume that overlapped execution is not possible on the input side of a multiplier. 1 By calculation, we can easily identify a critical path from a node to the end node. The nodes in Fig. 3 are sorted and numbered in descending order of the critical path length. Two nodes having the same critical path length can be numbered arbitrarily. These numbers appear on the right of the operator symbols of the nodes in Fig. 3 . Using these numbers, it is easy to identify the total execution time and the critical path. That is, we can follow the critical path from the start node to the end node by repeatedly selecting the child node having the lowest number. For the example in Fig. 3 , the minimum execution time is the critical path length of node " 1" and the nodes on the critical path are " 1," " 4," " 7," and " 8." The result is similar to the result of as-late-as-possible (ALAP) scheduling performed without a fixed control-step interval. In the extreme, if sufficient resources are available, the design can be executed in one control step through full chaining. It is clear that the design cannot run faster than this without modification of the CDFG using techniques such as pipelining or behavioral transformation [21] , [22] .
V. PERFORMANCE-DRIVEN HIGH-LEVEL SYNTHESIS
Performance-driven HLS requires the solution of the PDS problem and also control synthesis. We solve the PDS problem assuming variable operation delay and variable control-step interval and targeting minimal latency exploiting the BLC. This section explains the proposed solution. Control synthesis will be explained in the next section.
A. Resource-Type Selection
The rippling effect varies according to the type of operation. Even for operations of the same type, the effect varies 1 The reason behind this assumption is that in a typical multiplier such as an array multiplier, critical path lengths from higher input bits to the output MSB are comparable to those from lower input bits [19] . Therefore, we must apply all the input bits simultaneously so as not to slow down the operation. This is true for one input of the multiplier. However, for the other input, we can still exploit overlapped execution.
depending on the implementation. In other words, the overlap interval between chained operations changes according to resource-type selection, which selects a resource type for each node in a CDFG. Therefore, resource-type selection should be carried out prior to the delay calculation presented in the previous section.
An RTL unit can be realized in various structures with different costs and qualities. Some units' delays increase in proportion to the increase of the word length due to the rippling effect; so for a high-speed design we cannot avoid using units having small ripple delay, which are typically very expensive in area and power. However, as shown in Fig. 2 , if we apply BLC to such a design, we can effectively reduce the delay caused by the rippling effect down to one bit and, hence, avoid excessive use of expensive units. However, since each low-cost RTL unit executes slowly due to the large ripple delay, making the scheduler struggle with the lack of resources results in a much slower design. Therefore, low-cost RTL units should be used sparingly.
B. Performance-Driven Scheduling
After selection of resource types, we perform scheduling based on the delay information of the selected resources. Scheduling determines the time to start execution of each operation in a CDFG. Our scheduling algorithm, which we call continuous-time list scheduling (CTLS), is based on the list scheduling (LS) but is quite different from the conventional LS in several respects.
1) Scheduling Algorithm: LS is a well-known heuristic algorithm and has been widely used in HLS [3] , [4] , [23] . In the LS algorithm, beginning at the start node, we schedule ready operations whose predecessors are all scheduled. If the number of ready operations of one type exceeds the number of available functional units capable of performing that type of operation, then one or more operations with lower priority must be deferred. There are various ways of determining the priority; for example, we can use force [3] , mobility [4] , or path length [23] as the criterion.
The CTLS algorithm takes the critical path length as the priority criterion. However, critical path length is not computed in terms of a number of clock cycles but by accumulating the actual delays of the operations on the critical path. Note that we are deferring the decision on clock period until after scheduling. The CTLS algorithm is summarized as the following.
(1) Calculate the critical path lengths for all nodes starting from the end node. (6) Find out the earliest time point at which any resource is released (i.e., completes the scheduled operation and becomes available for a new operation). The results of the above algorithm are the time points at which the operations are scheduled and the time points at which control-step boundaries are placed. The intervals between adjacent control-step boundaries can be very different, which incurs a large waste of clock slack. We can avoid this problem by assigning multiple clock cycles to a control step. A control step with a longer interval may be assigned to more clock cycles. The clock cycle assignment problem is solved through the system clock optimization process described in the next section.
2) Scheduling Example: Fig. 4 shows a scheduling result for our illustrative example with the constraint of two multipliers, one adder, and one subtracter. If resource constraints are given, the total execution time may appear to increase over the critical path delay. In our example, the execution time increases considerably from to . The result is obtained by applying the CTLS algorithm with the critical path lengths as the priority.
Let us explain the details of the scheduling process with this example. The iterations of the loop in the CTLS algorithm are captured in the table shown in Fig. 5 . Each node in the CDFG is scheduled as soon as the node becomes ready for scheduling and the corresponding resource is available. Initially, there are five ready operations: " 1," " 2," " 3," " ," and "+10." Control-step 0 fully utilizes the two multipliers for the operations " 1" and " 2" and the adder for "+10," but the subtracter has idle time in this control step because no node of this operation type is ready. Control-step 1 begins instantly at the end time of the add operation and the adder becomes available. But the adder takes idle time because there is no ready operation of the corresponding type. Control-step 2 begins at the end time of operations " 1" and " 2," releasing the two multipliers and instantly activating them again for ready operations " 3" and " 4." The subtracter is also activated for operation " 7." Note that the subtract operation is chained at the bit level and can be scheduled at the same control step as operation " 4." Control-step 3 ends at the end time of the subtract operation and the subtracter becomes available. In control-step 4, the adder and the subtracter are used for the chaining between " 5" and " 8" and between " 6" and "+9."
C. Remarks on Scheduling Algorithms
The CTLS algorithm has several notable points that are different from the conventional LS algorithm.
1) Priority:
The CTLS algorithm takes the critical path length as the priority of an operation. Since the critical path length is obtained by adding up the operation delays in the continuous-time domain while considering BLC, the CTLS algorithm is more appropriate for our purpose than the sum of simple quantized delays of operations, which is usually employed by the conventional algorithm.
2) Ready List: Our algorithm keeps the ready list (i.e., the pool of operations ready to be scheduled) maximal. That is, the algorithm updates the ready list every time an operation within a control step is scheduled, while the conventional algorithm updates the ready list only when a new control step starts. Our approach avoids deferring critical operations and allows maximal utilization of chaining.
The example illustrated in Fig. 6 shows how our algorithm maintains the ready list and selects operations for scheduling and chaining. Given part of a CDFG, as shown in Fig. 6(a) , we obtain Fig. 6(b) with the conventional LS algorithm when one multiplier and one adder are available. We schedule operation " 3" in control-step and operation " 2" in control-step even though the priority of " 2" is higher than that of " 3." However, with the proposed algorithm, we obtain Fig. 6 (c) because operation " 2" becomes ready immediately after the scheduling of operation " 1" while we are still in control-step . Scheduling operation " 2" in the same control step as operation " 1" implies that they are chained.
3) Control-
Step Boundary: The proposed algorithm places the boundary of a control step at the earliest time point where the activated resources are released (i.e., newly available). This minimizes the idle time of the resources by reactivating the released resources as soon as possible without waiting for the re- lease of any other activated resources. Note that resource sharing between operations is possible only when the operations are executed at different control steps. Therefore, for a released resource to be used for another operation, we must start a new control step.
4) Noninteger Multicycling and Chaining:
Our scheme allows noninteger multicycling and chaining. By noninteger multicycling, we mean assigning a noninteger a number of control steps to an operation. 2 By noninteger chaining, we mean chaining a noninteger number of operations within a control step. These concepts together can be interpreted as multicycling of chained operations. In the example of Fig. 4 , operation " 7" is chained with operation " 4" and placed across control-steps 2 and 3. Operation " 7" takes the whole interval of control-step 3 but takes only a part of control-step 2; this may be regarded as noninteger multicycling. In control-step 2, operation " 4" and a part of operation " 7" are chained, which may be regarded as noninteger chaining.
5) Sharing of Chained Resources:
The proposed scheduling enables even the sharing of individual resources in a chain. In Fig. 4 , for example, multiplier " a" is shared by operations " 1," " 3," and " 5." Note that operation " 5" is chained with operation " 8." In a conventional design, chained resources are not shared because such sharing may cause complicated routing and additional interconnect delay. However, in some cases, sharing drastically reduces the required number of functional resources and thereby reduces the area, which results in reduced routing complexity and interconnect delay. Note also that the multiplier " b," which is scheduled for the operation " 4" in control-step 2, is chained with the subtracter " a" for the operation " 7" but the same multiplier is also scheduled and activated for the operation " 6" in control-step 3 before the end of the chained operation " 7." The new input applied to the multiplier of " 6" may disturb the result of " 7." Such problems are resolved by connecting the chained resources by means of a latch. In our example, we put a latch between the multiplier and the subtracter. The latch holds the last value of the multiplier's output (result of " 4") until the end of the chained operation (" 7"). The control unit enables the latch during control-step 2 and disables it at the beginning of control-step 3. The latch is enabled again in control-step 4 for operation " 8."
D. Maximal Resource Utilization
The proposed scheduling addresses the objective of minimizing execution time under resource constraints; this goal can be achieved by improving resource utilization. The utilization factor of a resource is given by the fraction of the total operation time of over the total design execution time . That is (9) where represents the delay of and is the activation count of during .
Example 1: Consider the design shown in Fig. 3 . Let us assume a schedule with latency of 400 ns and that the resources available are a multiplier, a subtracter, and an adder having delays of 60, 26, and 24 ns, respectively. In the CDFG, there are ten operations in total, including six multiplications, two subtractions, and two additions. By applying (9), we can obtain the utilization factors of the resources, which are 0.90, 0.13, and 0.12, respectively.
From the example above, we see that the multiplier is utilized much more than the other resources. It is evident that the scheduler struggles with the lack of multipliers. We would expect the latency to decrease if more multipliers were available. The schedule with two multipliers is shown in Fig. 4 , which has a latency of 188 ns. In this case, the utilization factors become 0.96, 0.28, and 0.26, respectively. We see that the scheduler still struggles with the lack of multipliers and expect that allocating more multipliers would further decrease the latency.
The problem of scheduling with latency constraints can be solved by iteratively performing resource-constrained scheduling. That is, we can find a solution with minimal resource cost satisfying the latency constraints by solving a resource-constrained scheduling problem iteratively with resource allocation varying from minimal cost to maximal cost until the latency constraints are satisfied. The resource utilization factor can be used to determine which type of resource to add to the set of allocated resources. Applying this heuristic to the above example, we add one more multiplier because that is the type of resource with the highest utilization factor.
VI. OPTIMAL CONTROL SYNTHESIS
The goal of control synthesis is to synthesize a control unit with minimum waste of clock slacks and minimum number of control steps. We minimize the waste of clock slacks to minimize the latency; it can be achieved by optimization of the clock period. The number of control steps is also minimized for simple control logic.
A. Minimization of the Number of Control Steps
CTLS generates a new control step at the completion time of an operation if no further chaining is possible. Therefore, in the worst case, the number of control steps can be increased up to the number of operations in a CDFG. Since a control unit changes the control-signal values at the transition to the next control step, the complexity of the control unit depends on the number of control steps.
In some cases, we can merge two adjacent control steps, thereby simplifying the control unit. Consider the example in Fig. 4 . At control-step 0, the control unit activates units " a," " b," and "+a" for operations " 1," " 2," and "+10," respectively. We use the notation " a_1," " b_2," "+a_10" to represent the set of control signals required at this control step. At control-step 1, the control unit is still activating units " a" and " b" for operations " 1" and " 2." The necessary control signals are represented by " a_1," " b_2" , which is a subset of the set of control-step 0. In this case, we can retain the set of control signals of control-step 0 in control-step 1 without affecting functionality. Therefore, we merge the two control steps into one and reduce the number of control steps from five to four. With this simple process, only adjacent control steps can be merged. However, it is very effective because CTLS tends to generate many control steps that can be merged in this way.
B. Minimization of Clock Slacks
The latency of a design is determined by the product of the clock period and the cycle count. However, for a design scheduled without fixed clock period, as in our case, the control steps have nonuniform intervals. In this situation, if the clock period is set to the largest interval, a lot of slack time will be wasted. For the example shown in Fig. 4 , the four control steps have delays of 60, 60, 8, and 60 ns, respectively. The total latency will be 240 ns, with 60 ns clock period and four clock cycles. This implementation wastes a significant amount of time because of the large slack in the third control step. In contrast, if the clock period is set to 10 ns, then the latency can be reduced down to 190 ns with 19 clock cycles-only wasting a slack time of 2 ns.
For our example, the change of latency and cycle count (taking into account the change of clock period) is shown in Fig. 7 . The latency changes abruptly at points where the cycle count changes and decreases monotonically as the clock period decreases within the flat ranges of the cycle count. The optima coincide with the points where the cycle count changes. Therefore, it is sufficient to check these points to find the optimum clock period for the minimum latency.
C. Equalization of Control-Step Intervals
Before we start the optimization process, we must try to equalize the control-step intervals. The boundary between control-steps 3 and 4 in the example of Fig. 4 can be moved down until it hits the start point of operation " 8," leaving the resource usage and the latency intact. We move down the boundary such that the intervals of control-steps 3 and 4 are equalized. The resulting intervals of the four control steps are 60, 60, 34, and 34 ns, respectively. By moving this boundary, we can reduce the waste of clock slack time in control-step 3. However, in some cases, this process makes the results worse. Making the interval of a control step a multiple of that of another control step may help in obtaining the optimum solution. In our current implementation, we equalize the intervals as much as possible since this strategy helps to accelerate convergence of the process. 
D. Search Process of Optimal Clock Period
As shown in Fig. 7 , latency converges to the minimum value as the clock period decreases. We start the optimization process from the maximum control-step interval (60 ns in our example). We compute the clock slack for each control step by subtracting the control-step interval from the clock period. If the sum of the clock slacks computed for the clock period is zero, then the process is complete. Otherwise, we decrease the clock period to the next optimal point, compute the clock slacks, and check the sum of clock slacks again. This process continues until the zero slack point is found or the preset lower bound of the clock period is reached. In the latter case, we select the clock period that gives the shortest slack among the optimal points obtained so far.
The search process for our example is shown in Fig. 8 . The first column represents the change of cycle count, the second column represents the intervals of control steps divided by integers, and the third and forth columns represent the total latency and the sum of clock slacks, respectively. The initial cycle count before optimization is the number of control steps obtained from earlier scheduling and the initial clock period is the largest interval, which is 60 ns as shown in boldface in Fig. 8 . The subscript is the integer that divides the control-step interval. We decrease the clock period gradually by dividing the intervals with increasingly large integers and checking the sum of clock slacks. In the second iteration, we divide the interval of 60 ns by two to obtain 30 ns. However, since that is smaller than 34 ns, we take 34 ns as the new clock period. Now each of the first and the second control steps takes two clock cycles and the total number of cycles becomes six.
As we continue the division process, the clock frequency increases and the total latency tends to decrease. However, the clock speed cannot be set arbitrarily high because of the clock skew problem and the latency of the control unit. In our example, we set the upper bound of the clock speed to 100 MHz and the division process stops at the clock period of 11.4 ns. We select 12 ns as the clock period because it gives the lowest waste of slack time.
During the clock optimization process, the number of clock cycles increases as the divisor increases. The total number of clock cycles is exactly the same as the sum of the divisors in the selected row of the table. The increase in the number of clock cycles produces a corresponding increase in the number of states of the FSM implemented in the control unit. Note, however, that the control-signal values are not changed by the optimization process. That is, the states originating from a control step have control signals with the same values. This implies a minimal increase in the complexity of the control unit.
Power consumption also increases due to the increased number of clock cycles. However, because the control signals change their values in the same way as before optimization, power consumption in the datapath remains the same, excepting the clock lines. Moreover, we can trade off the improved performance against power reduction through voltage scaling, which is the most effective technique of power reduction [24] .
E. Computational Complexity
The clock period optimization problem can be simply formulated as finding the greatest common divisor (GCD) of a given set of values. However, the values may not be integers or the GCD solution may not be bounded by the given lower bound of the clock period. So we take a different approach. In our optimization algorithm, the computational complexity is proportional to the size of the searching table. The number of columns of the table increases as the number of control steps increases and the number of rows increases as the intervals taken by control steps increase and the lower bound of the clock period decreases. If denotes the maximum interval of the control steps and denotes the lower bound of the clock period, the size of the table is , which is also the computational complexity of our algorithm.
VII. TARGET ARCHITECTURE
Since the area, performance, and power consumption of the synthesized design vary depending on its architecture, we need to consider the target architecture for a precise evaluation of synthesis algorithms. In this section, we examine the proposed approach from the architectural point of view.
A. Configuration of Registers
There are several ways of configuring registers. One way is to centralize them as global registers, which are treated as independent modules like a functional unit and connected to other modules through buses [3] , [5] , [25] . Another way is to use distributed local registers, which are placed near the functional units and connected to them through abutting or local wires [26] , [27] . The most important difference between the two configurations is in data transfer. For one operation, the global register model requires two bus transfers, whereas the local register model takes only one. In general, buses exhibit a large propagation delay and consume a lot of power and, consequently, the local-register model is preferred for a high-performance design although it is less area efficient.
The architecture with local registers that we adopt is shown in Fig. 9 . Each functional unit has local registers and is connected to the registers with local wires rather than buses. A data transfer path is configured by controlling the tristate buffers and register files. The bus interconnection is used only for the data transfer from a functional unit output to a register input. To support operator chaining, we need to build links from the outputs of functional units to the inputs of other units without passing through registers. Furthermore, for the sharing of a chained operator, we need a level-sensitive latch in the link.
B. Considerations of Delay
The performance of a design is affected by the delay of registers and interconnects (including buffers and multiplexors), which must be considered in improving a design, as well as the delay of functional units. However, since these delay values are not known prior to scheduling, we cannot add them into the calculation of the critical path lengths. One possible approach is to consider delays after the layout is obtained. We can extract the delay values from the layout, annotate them to the scheduled CDFG, and perform the clock selection again. This process may change the clock period. However, the datapath does not change and only the control unit is subject to slight change. As the feature size decreases toward deep submicrometer, the interconnect tends to have the property of a transmission line. Therefore, the interconnect delay is affected by the wire length more dominantly than by the loading capacitance [28] [29] [30] . Assuming that interconnect delay is proportional to wire length, we can roughly estimate the amount that the interconnect delay will increase with an increase in total area. For a design of area , we assume that the average bus wire length is proportional to and, therefore, the average interconnect delay for a bus is also proportional to . We further assume that the average interconnect delay per operation is proportional to the average number of buses per operation. Therefore, the average interconnect delay per operation may be estimated roughly as (10) where is a constant and is the average number of buses per operation. The value of can be different depending on the architecture or layout design style.
VIII. EXPERIMENTAL RESULTS
To evaluate the proposed scheme, we have experimented with designs selected from the examples provided with the HYPER system [16] , [24] and HLS benchmarks [31] , [32] . For RTL components, we have used the 2.0 m dpp library resident in LagerIV [33] . The designs include several Avenhaus IIR filters of different structures (wdf, cascade, and parallel), an eleventhorder FIR filter (fir11), a seventh-order IIR filter (iir7), a fifthorder elliptic filter (elliptic), a decimate-by-four filter (decby4), a filter for noise canceling (noisecancel), other digital filters (volterra and lattice), a differential equation solver (diffeq), and two transformation units (wavelet and dct). Table I shows the details of CDFGs for the designs, including the total number of nodes and edges in the graph, the number of nodes on the critical path from the start node, and the critical path length calculated with BLC. Note that the critical path length is the lower bound of the execution time (i.e., the design cannot run faster than the delay). Table II shows the results of the scheduling and clock period optimization. The results of the proposed approach are com- pared with those from conventional LS [3] , [23] . We have applied our proposed approach with CTLS BLC and with CTLS alone. Both give significant latency reduction (or throughput improvement) in all examples compared to the conventional approach. Fig. 10 shows the relative contributions of CTLS and CTLS BLC to the throughput improvement for each of the benchmarks. CTLS seems to make the dominant impact on overall improvement but BLC also contributes significantly. Note that the results in Fig. 10 have been obtained with restricted resources. If we increase the resources, the impact of BLC increases more rapidly. Fig. 11 shows the throughput change of a design (elliptic), with throughput normalized by the maximum throughput that can be obtained. We can see that the impact of BLC increases with resource availability. The reason behind this is the increased scope for BLC with increased resources.
A. Cost of Registers and Buses
Table II also compares the number of registers and buses. CTLS tries to activate the maximum number of functional units in a control step for higher resource utilization. This increases the number of concurrent data transfers and consequently increases the number of buses. Since the number of buses required for a design is determined by the maximum number of data transfers in a control step, the number of buses is roughly proportional to the maximum number of units activated in a control step. Throughout the experiments, we have observed that while LS tends to have a very different number of activated units depending on the control step, CTLS tends to have a relatively uniform distribution over the control steps. This is because CTLS packs operations more densely in each control step. Such uniform distribution helps minimize the increase of the number of buses in a real design.
CTLS also tends to increase the number of registers due to operator chaining, which increases the number of live variables in a control step. Again, we have observed that CTLS tends to have relatively uniform distribution of the number of live variables between control steps, which helps minimizing the increase in the number of registers.
B. Cost of the Control Unit
In contrast to LS, where the clock-cycle count equals the number of control steps, the proposed approach tends to have a cycle count much greater than the number of control steps. Moreover, even the number of control steps may be greater than that of LS. The increase of the number of control steps and clock cycles causes an increased complexity of control unit. In the worst case, the number of control steps can be as large as the number of nodes in a given CDFG. In our experiments, however, we obtained a 7% decrease in the number of control steps, a 62% increase of the number of clock cycles (states) on average, and a 200% increase in the number of clock cycles in the worst case, which corresponds to two more flip-flops in an FSM implementation [34] [35] [36] . The decrease in the average number of control steps is mainly due to extensive chaining and control-step merging.
C. Comparison with the Template-Mapping Technique
The proposed scheme is similar to the template-mapping technique [7] in that both techniques improve the throughput by using BLC and system clock optimization. They give comparable performance improvement because the main source of improvement in both approaches is the extensive application of BLC and the exploration of the optimal clock period.
However, the two approaches are different in that while CTLS BLC tries to explore the optimum clock period after scheduling, template mapping carries out scheduling iteratively for all clock periods from lower to upper bound with a given granularity. So, to obtain good results, template mapping requires a large number of templates, which causes the rapid increase in its computational complexity. Another difference is that CTLS BLC tries to minimize the sum of control-step intervals while template mapping aims to reduce the critical path delay of a given CDFG by replacing a set of original nodes with a more complex node with built-in BLC. This difference causes significantly different scheduling results from the viewpoint of design size. The template-mapping technique pays a large area penalty because it uses specialized hardware units that can be shared by few operations in a CDFG. Template mapping requires up to 225% more functional resources and up to 782% more interconnect resources in the worst case and, on average, 66% more functional resources and 64% more interconnect resources [7] . In contrast, the proposed approach does not increase the requirement for functional resources but increases the number of registers and buses by an average of only 7.5% and 21%, respectively.
It is reported that the template-mapping technique decreases the number of nodes in a CDFG by 23% on average due to chaining, implying that the number of buses decreases by that amount. However, since the technique generates large-area designs, the wire length may increase. Applying (10) and considering only the active area increase of 66%, we obtain almost no change in the interconnect delay per operation.
D. Comparison with Context-Sensitive Scheduling
Context-Sensitive Scheduling (ContSched) [18] also exploits BLC to improve the performance. It is similar to our approach in that scheduling is carried out with resource constraints. However, the important difference is that ContSched requires a fixed clock period to be specified by the user. There is no exploration of the system clock period after scheduling. We have experimented with the use of ContSched in the fifth-order elliptic filter example already presented, under the same BLC condition. Table III contains the experimental results for the three cases of resource constraints, showing that CTLS and CTLS BLC are superior to ContSched.
Table III also shows the utilization of adders and multipliers and the ratio of the actual to maximum throughput. ContSched ignores the fact that the scheduling results are sensitive to the given clock period and, therefore, wastes time and leaves resources idle. 
E. Layout Implementation
To see the deep submicrometer effect, we have implemented a design example from Table I (diffeq) into a layout with a 32-b word-length datapath using the Hyundai Microelectronics library [37] , which consists of 0.35-m 3.3-V standard cells. First, the VHSIC hardware description language (VHDL) design is compiled and converted to a CDFG using a toolkit [38] . An annotated CDFG is then obtained through the HLS process, including our method of scheduling and control synthesis. Finally, after RTL netlist generation, a layout is generated by a placement/routing tool [39] using triple-metal layers. We have obtained a layout of 0.64-mm silicon area, excluding power ring and pad frame. For postlayout design analysis, delay information is extracted from the layout and annotated backward into the RTL design. We have measured an execution time of 81.6 ns and power dissipation of 55.4 mW using an analysis tool [40] . We have also built a layout for the RTL design obtained by conventional LS. The comparison on design qualities is shown in Table IV . We see that throughput increases by 41.9% at the expense of 32.8% power increase. However, note the small increase in the required energy and area and also that the improvement in throughput is almost the same as that in Table II .
IX. CONCLUSION
In this paper, we propose a new scheme of HLS that aims at high-performance datapath-intensive design. The key concepts in the proposed scheme are noninteger multicycling, chaining, postschedule clock selection, and their synergy when combined with BLC. The proposed scheduling algorithm carries out resource-constrained scheduling in the continuous-time domain with the objective of minimizing latency and then the waste time of clock slack is minimized through the optimization of the system clock period.
The proposed approach can be extended to solve the latency-constrained area-minimization problem. It can also incorporate postlayout optimization by performing clock selection after layout generation and delay extraction. Our approach exploits even the sharing of chained functional units, thus reducing the area drastically without significant delay increase. Experimental results show significant improvement in throughput and area compared to previous approaches.
In future work, we will be investigating the application of the proposed scheme to the synthesis of hardware with complex control structure and pipelining. To support conditional branches and loops, we can extend our approach by adopting the method used in force-directed scheduling (FDS) [3] . In this case, we use the concept of force as the priority instead of critical path delay. The time frame of a node is computed by two accumulated delays: one from the start node and the other from the end node. The time-frame length can be a noninteger multiple of the cycle time. To support functional pipelining, we can adopt the scheme used in [3] with some modification.
