Abstract
Introduction
Various trade-offs during behavioral synthesis have been shown to have a major influence on design parameters such as area, power, and delay. For example, work in synthesis for low power has shown that decisions at the behavior level can result in an order-of-magnitude power reduction [1, 2] . Behavioral synthesis is the process of translating a circuit's behavioral description, often presented in a high-level description language (VHDL, Hardware C), into an appropriate register-transfer level (RTL) design. The different synthesis tasks include scheduling, allocation, assignment, clock selection, and module selection. This paper focuses on the process of selecting a clock period for scheduling control-flow intensive (CFI) behavioral descriptions. The selection has to made in such a way as to optimize the performance of the schedule obtained. Our technique accepts as inputs, a CFI behavior, which is a functional specification with significant control-flow in the form of nested loops (possibly with unknown iteration bounds), conditionals, and resource allocation information (numbers and types of functional units, memories, and registers available for the design 1 ), and returns a small set of clock 1 note that this is different from assignment or binding which specifies mapping of variables, arrays or operations to existing resources.
periods that can be used to derive a performance-optimized schedule.
Many previous works have addressed clock selection during high-level synthesis. Slowest critical path component delay is used in [3] to estimate the clock period from a data-flow graph. The shortcoming of this method is that functional units are not utilized fully. In [4] , a clock estimation method based on slack minimization attempts to optimize performance by improving the utilization of functional units. In [5] , a similar approach is used to find an optimal clock period; this method only has the ability to handle simple equi-probable conditional branches. Another optimal clock period selection technique is presented in [6] , where a set of candidate clock periods is derived such that the optimal clock period is guaranteed to exist within the set. Resource-and performanceconstrained clock estimation are proposed in [7, 8] . In [7] , a 3D search engine is presented to trade off schedule length, area, and clock period. In [8] , the methodology tackles pipelined designs and considers controller delays when searching for a set of candidate clock periods. In [9] , the optimal clock period is derived from a synthesized RTL architecture. This method requires that some information is known beforehand about the number and type of resources and resource sharing in the datapath.
Most of the above techniques are similar in one aspect, and that is they are geared towards data-dominated behaviors. Datadominated behaviors are composed of one thread of execution and are constrained by a sampling period. CFI behavioral descriptions are characterized by multiple threads of execution and possibly unknown loop bounds. This adds non-determinism to the process of selecting a clock period. In [10] , clock period optimization for CFI designs is addressed. However, the method is limited to behaviors with sequentialized loops and conditionals.
The algorithm presented in this paper has the ability to tackle CFI behaviors with multiple nested loops and conditionals without compromising any parallelism in the behavior. The algorithm examines the effect of the given resource information, but unlike previous methods, it also takes into consideration the effect of branch probabilities when estimating a clock period. The algorithm can function under two modes: (1) it can accept, as an input, the number of resources available, without any information about their exact type (e.g., it may be known how many adders are to be used in the design, but not whether they are ripple-carry or carrylookahead), and returns a set of candidate clock periods, or (2) it can accept information about the number and type of resources, and returns a set (typically smaller than that of mode (1)) of candidate clock periods.
The rest of the paper is organized as follows. Section 2 illustrates key algorithmic ideas through examples. Section 3 describes our algorithm. Section 4 reports experimental results and Section 5 concludes.
Motivation
Scheduling is the process of fixing the cycle-by-cycle behavior of the circuit by assigning operations in the behavioral description to states in a finite-state machine. The goal is to choose a clock period such that the execution time, which is the product of the clock period and the number of cycles in the schedule, is minimized. Unlike data-dominated behaviors, which are characterized by a fixed sampling period that has to be met, threads of execution in CFI behaviors vary in length and the length often depends on the input stimuli. For this reason, the expected number of cycles (ENC) metric is used to quantify the number of cycles needed for a CFI behavior to complete execution on average [11] .
In this section, we motivate our key ideas through examples. Example 1 illustrates the effect choosing a clock period has on the performance of a behavior. Example 2 shows how making simplifying assumptions about branch probabilities (for example, assuming that all branches are equi-probable) can adversely affect the performance as well. Example 3 shows that constraints on functional unit allocation can also modify the criticality of paths in the behavior, and hence, change the optimal clock period. First, consider using the maximum operator delay approach [3] , where the delay of the slowest operation on the critical path is assigned as the clock period. Examining Test1 reveals that loop L3 depends on loop L1, and hence cannot be executed in parallel with loop L1, and similarly, loop L4 depends on both loops L1 and L2. If operation ½ evaluates to true with a probability of ¼ , then the critical path of the behavior is L1, L3, L4 . The maximum delay operator for this path is ½ Ò×. cycles to complete execution. This loop has two paths: the path that is the if branch and the path that is the else branch. The if path is one instruction long, while the else path is five instructions long. Since L2 is 750 iterations long, we can estimate the expected number of cycles of loop L2 by using the formula:
where Ô is the probability of taking the if branch. Each loop has some setup cycles associated with it. Figure 3 is another high-level view of a schedule; this one is derived using a ½¾Ò× clock. The number of cycles of execution for loops L1 and L3 remains the same. However, the number of cycles of execution for loops L2 and L4 has increased. This is due to In the case of loop L4, this nearly doubles the number of cycles because the pipeline cannot be filled with useful work every cycle as before. However, this increase in the number of cycles does not increase the overall execution time of the schedule. On the contrary, there is an improvement in execution time from 29,540Ò× to 26,520Ò×. By examining the critical path of the behavior, it is clear that the bulk of the work is dominated by memory load and addition operations, both of which have a delay closer to ½¾Ò× than to ½ Ò×. In other words, the ½¾Ò× clock period appears to be utilizing the critical operations more efficiently. For the remaining examples, we will make the simplifying assumption that we have one of two clock periods, ½ Ò× and ½¾Ò×, to choose from.
The algorithm presented in Section 3 derives a high-level view of the schedule, as shown above, and examines the trade-offs for different clock periods. Figure 1 . The functional unit allocation constraints are the same as those used in Example 1. Again, the goal is to maximize the schedule performance. However, this time, the probability Ô that operation ½ evaluates to true is assumed to be ¼ ¿, instead of ¼ .
It is clear that the critical path within loop L2 has shifted from the if branch of the conditional to the else branch. The series of dependencies within the else branch prohibit any pipelining of operations and result in a path length of five operations. Since the multiply (£) operation delay is ¾ Ò×, using a ½¾Ò× clock period will require three cycles for each such operation to complete. In other words, the else path has a length of nine cycles. Using this information, the number of cycles required for loop L2 to complete is 4,955. The total execution time for loop L2 increases from 12,624Ò× to 59,460Ò×. Not only did the critical path of L2 change, but the critical path of the schedule also changed to L2, L4 . As shown in Figure 4 , the total execution time increased to 61,896Ò×. If we revert to the ½ Ò× clock period, as shown in Figure 5 , each £ operation requires only two cycles to complete execution, and hence, the length of the else branch becomes seven cycles. Now, loop L2 completes execution in 3,905 cycles or 54,670Ò×. In addition, the time for loop L4 to execute reduces to 1,442Ò×
since the ½ operation can fit into one clock cycle. Choosing a clock period of ½ Ò× reduces the execution time of the behavior by about ¿±.
In the example above, we were able to evaluate the effect that branch probabilities have on each of the loops independently. This is because the resource constraint provided enough hardware to allow each loop to execute at its maximum rate. It is often the case, however, that these loops will contend for resources and the algorithm needs to take the interaction of both branch probabilities and resource contention into account, as shown next. Example 3 : In this example, we consider the problem of choosing a clock period given the same functional unit allocation as before, but with only one adder of type add1 instead of four. The branch probabilities are the same as those used in Example 2. Four adders in the previous examples allowed the pipelining of operations in loops L1 and L3, which resulted in the initiation of an addition operation every cycle. One adder limits this initiation and results in a three-fold increase in the number of cycles required to complete each of these loops. Therefore, loops L1 and L3 will require approximately 3,000 cycles each to complete. In addition, there is contention between loops L1, L3 and loop L2 for the use of the adder. Loop L2 is expected to require the adder for ¾¾ iterations.
Since loop L2 is in parallel with loop L1 for about 3,000 cycles, approximately ½ ¿ of the additions will contend with loop L1, and the remaining ¾ will contend with loop L3. If we assume that the resource conflicts are resolved equally among the loops, then the number of cycles for loops L1, L2, and L3 are 3,091, 4,018, and 3,029, respectively. Figure 6 illustrates the completion time for all the loops under these conditions.
Clearly, under these resource conditions the critical path has once again shifted to the sequence of loops L1, L3, L4. As we have seen in Example 1, using a ½ Ò× clock under-utilizes the operations on the critical path (predominantly the memory loads and additions). Once again, if we choose a clock period of ½¾Ò×, the operations are utilized more efficiently, and as can be seen in Figure 7 , we can further improve the completion time for the schedule by 11,246Ò×.
The above examples have illustrated a few important points. Similar to previous research, we have seen that the critical path is a major determinant of what the clock period should be and that hence, what the target clock period should be in order to optimize performance. We cannot assume that the critical path will remain the same when any one or more of these factors is changed, and therefore, a detailed yet fast critical path analysis is required when evaluating candidate clock periods.
The Clock Selection Algorithm
In this section, we present details of our clock selection algorithm. Figure 8 gives a high-level block diagram of the algorithm. It is composed of two blocks. Clock candidate generation decides which clock periods are good candidates for selection, and critical path analysis evaluates the clock periods and their effect on the behavior. These two blocks interact in an iterative fashion to produce a solution. The input to the algorithm is a behavioral description of the circuit, often presented as a control-data flow graph (CDFG), a resource library, and resource constraints. The resource library is a list of pre-characterized resources available to the designer and includes data such as resource area, delay and power. The resource constraint may be provided as: (1) Section 3.1 describes the clock candidate generation algorithm and Section 3.2 explains our critical path analysis technique which is based on our performance estimation engine [12] . For the sake of completeness, we present some of the important concepts of this estimation engine.
Clock Candidate Generation
In this section, we describe the clock candidate generation procedure. The algorithm makes use of two observations: (1) the clock period should be optimized for the critical path in the behavior, and (2) the most frequent resources on the critical path must be utilized as much as possible. However, these two observations do not preclude the fact that choosing one clock period over another may actually change the critical path. For this reason, it is necessary for the algorithm to provide any such information to the designer by returning a set of clock periods. The resource information is now used to identify the delay of the operation chosen. If no information is given by the user about the type of resource (e.g., whether the adder is a carrylookahead adder or a ripple-carry adder), then all available resources in the design library for that operation are considered. The next step is to evaluate candidate clock periods. The set of candidate clock periods is augmented with the delay, , of the current operation and quotients of the delay in the form of ¾ ¿ , plus a user-defined threshold, Ø, that examines clock periods above the delay of the operation in fixed steps (again, a larger Ø will mean more CPU time expended in searching). While the quotient set can be as large as required, in practice ¾ was found to suffice. For example, if the delay, , of an operation is ½¼Ò×, and Ø is ¾Ò×, then the set of clock periods becomes Ò× Ò× Ò× ½¼Ò× ½¼ Ò× ½¾Ò× . Each of these clock periods is evaluated using the performance estimation engine and the result is stored in a look-up table (obviously, if a clock period has been previously encountered it is not reevaluated). At this point, it is important to examine whether or not the critical paths have changed (this is most likely to occur in the first iteration). If this is the case, a new ranking has to be performed for the new critical paths and the process repeated. However, this is only done once all the operations on the current critical path have been evaluated. It is typically the case that the optimal clock period will lie between a small number of ranges, and hence, many of the clock periods arising from a new critical path will have been already encountered.
Once the possible critical paths have been examined, a set of clock periods is returned. The set includes the best clock period for the different resource combinations that can arise on the various critical paths. The more rigid the resource information is, the smaller the set of clock periods.
Critical Path Analysis
Traditional methods of evaluating schedule performance typically fall within simulation-based techniques or statistical techniques. As the name implies, the first method performs a simulation of a fully functional and correct schedule, and evaluates it on a cycle-by-cycle basis. Obviously, this method is accurate but compute-time intensive. The second method models the schedule mathematically (as a Markov model for example) and derives a measure of its performance by solving a series of equations. This method also depends on a fully functional and correct schedule.
Such methods have at least Ç´Ò ¿ µ complexity, where Ò is the number of states in the state-transition graph. It is typically the case that an aggressively derived schedule for CFI behaviors will have over 200 states.
We use a performance estimation technique first introduced in [12] . The estimation algorithm takes advantage of two observations: (1) for the sake of a performance estimate, a fully functional schedule is not necessary, and (2) CFI behaviors typically exhibit the 90/10 locality rule [13] . This rule states that 90% of a program's execution time is spent in 10% of its code. The algorithm, therefore, attempts to identify and evaluate these threads of execution.
First, the algorithm extracts a phase graph from the behavior. A phase graph is an acyclic graph whose nodes represent phases and edges represent transitions from one phase to other phases. A phase,¨Ë, is defined as a set of loops, Ë L1, L2, ..., LN , executing simultaneously. When any loop changes state, i.e., falls through, or a new loop begins to execute, a new phase,¨Ë¼ , is entered where a new set, Ë ¼ , of loops is being executed. Each edge,¨ ¨¼, is associated with a transition probability, Ô¨ ¨¼ .
Consider the behavior given in Figure 10 which consists of three loops. Loops L2 and L3 execute in parallel and both depend on loop L1. The phase graph for the example of Figure 10 is given in Figure 11 . This graph is used to identify the critical paths in the behavior. This is done by pruning out all paths whose probabilities of execution lie below a user-defined threshold, Ô Ø . For example, in Figure 11 , if we choose Ô Ø to be 0.05, then the transition from to¨ Ä¿ which has a probability of 0.03 is pruned and we only consider the phases enclosed within the dashed line. The performance estimate is then given as the sum of execution times over all phases on the critical path. To calculate the time taken for each phase to execute, we use an engine that may be viewed as a machine with a set of resources on which the behavior is running as a program. This engine evaluates the rate, ÖÄ , at which operations execute without performing a full schedule. For behaviors without conditionals, it is found that the rate at which a loop executes is equal to the rate of its slowest operation, ÓÔÒ , or:
Since it is the case with CFI behaviors that many operations do not execute with probability 1 due to branching and nested loops, the above equation can be modified to take this into account:
where Ô is the probability that an operation, ÓÔ , will execute. Similarly, the rate of phase,¨, is determined by the slowest loop, and hence we can replace ÖÄ in Equation (2) by Ö¨. The estimated time of execution is then given by:
where Ì´¨Ë µ is ½ Ö¨Ë , or the time taken for phase¨Ë to execute and Ô´¨Ë µ is the probability of entering phase¨Ë. 
Experimental Results
In this section, we demonstrate the capabilities of our algorithm by applying it to several example behaviors. The behaviors are characterized by control-flow operations and loops. To evaluate a given clock period, we feed the behavior, clock period, and resource information to an aggressive scheduler [14] . The scheduler performs loop unrolling and parallel loop optimization, and returns a state-transition graph and a performance measure or ENC. The benchmarks used include: MmT, a memory-intensive benchmark from [12] , Greatest Common Divisor (GCD) [15] , a Blackjack dealer process (Dlr) [10] , the send process of the X.25 communications protocol [11] , and Ocn and RyT, which are modified versions of the SPLASH-2 benchmarks Ocean and RayTrace [16] .
The first set of experiments (Table 2 ) examines the set of clock periods obtained using our algorithm and its effect on the ENC and expected execution time (EET) when information about both the number and type of resources available is given to the algorithm. The EET is the product of the ENC and clock period. We compare the results to the maximum operator delay [3] and to the optimal clock period. The maximum operator delay is the delay of the slowest resource used in the behavior and the optimal clock period is found using an exhaustive search. Note that since scheduling is done (to determine the ENC) after the clock selection algorithm has been run, it is possible for our algorithm to yield two clock periods which have the same ENC, e.g., MmT and X.25. Clearly, one of these clock periods is inferior to the other. We see that the optimal clock period is always within the set our algorithm derives.
The second set of experiments (Table 3) demonstrates our algorithm's ability to identify a set of clock periods within which the optimal clock period exists given only the number of resources available without any information about their types. We again compare our results against maximum operator delay and optimal clock period based results. For the optimal clock period search, we need to use information about the types of resources. This is not given to our algorithm, since the goal here is to show that even without this information our algorithm can still identify the set in which the optimal clock period lies.
Next, we show our algorithm's ability to adapt to different resource constraints provided to the behavior (Table 4) . We modify the resource constraints for benchmark MmT and compare the results for the clock period obtained using our algorithm to those obtained using the optimal clock period. In this experiment, we adjust the algorithm to return only a single clock period (this is done by choosing the clock period that produces the best performance estimate). Table 4 shows that with the exception of one example, the optimal clock period was found. The single discrepancy may be attributed to the slight inaccuracy associated with the performance estimation procedure. However, even in this case, the EET is very close to the optimal.
The algorithm was run on an SGI Challenge workstation with ¾ Å of memory. Typical run times for our algorithm ranged from 40 seconds to 5 minutes, compared to a range of 1.5 hours to 9 hours when performing the optimal clock period search.
Conclusions
In this paper, we presented an algorithm that performs clock selection for CFI behavioral descriptions. The algorithm examines the critical paths in the behavior and chooses a set of clock periods that attempt to utilize the predominant functional units on these paths as efficiently as possible. The algorithm has the ability to accept as an input either detailed information about the resources allocated for the datapath if they are known, or else, simply an indication of the number of resources available and proceeds to evaluate all possible resource types available. Experimental results demonstrate the algorithm's ability to produce optimal clock periods, while adapting to different branch probabilities and resource constraints.
