This paper addresses the problem of true delay estimation during high level design. The true delay is the delay of the longest sensitizable path in the resulting circuit, as opposed to the topological delay which is the delay of the longest path in the circuit. The existing delay estimation techniques either estimate the topological delay, which may be pessimistic if the longest path is unsensitizable or false, or estimate the true delay using gate-level timing analysis which may be prohibitively expensive.
Introduction
In the process of designing a circuit from a behavioral speci cation, parameters like area, delay, or power consumption of the nal implementation in uence many of the high-level trade-o decisions 1] . Exact values of these parameters are attainable only by way of a time-consuming implementation of the design. Hence the need for fast and accurate estimation techniques. In this paper we present a comprehensive analysis of the role of resource sharing during the high level synthesis process in creating false paths, and address the problem of true delay estimation in the presence of false paths created during high level design.
The topological delay of the longest path in the circuit, while simple to estimate, can be overly pessimistic, since many long paths may not be sensitizable 2, 3] . Several gate-level timing analysis techniques have investigated the exact conditions under which a path can a ect the clock period of the circuit 2, 3, 4, 5, 6] . We refer to the conditions as sensitization conditions, and the paths which a ect the delay of the circuit as sensitizable paths, as explained in Section 2. The sensitizable paths are also referred to in the literature as true paths. Any path which is not sensitizable is termed an unsensitizable or false path. The true delay of a circuit is the delay of the longest sensitizable path in the circuit. A circuit will operate correctly if its clock period is greater than or equal to the true delay of the circuit.
Gate level timing analysis techniques which consider path sensitization for circuit delay estimation have been presented in 2, 3, 4, 5, 6]. As opposed to fast topological delay calculation, the path sensitization based techniques have the potential of calculating accurately the true delay of the circuit. However, since they have to check sensitization conditions for every path, the gate-level timing analysis techniques can be computationally very expensive, and not feasible on large circuits or circuits having arithmetic functions. Speci cally, the gate-level timing analysis techniques are too slow to be used during high level synthesis.
High Level Delay Estimation
Delay models and delay estimation techniques for behavioral level synthesis have been presented in 7, 8, 9, 10] . E ect of the controller on the delay of a circuit and estimation of the corresponding delay has been addressed in 11]. These techniques compute estimates of the topological delay of the circuit at the RT-level. Some of the techniques incorporate the e ect of layout while computing an estimate for the topological delay 8, 9, 10] .
A high level timing model based on the behavioral speci cation of a circuit has been proposed in 12]. The model is used for estimation of the clock period prior to scheduling and resource sharing. The e ect of resource sharing on the delay of the nal implementation can be signi cant, but is not considered. Techniques for clock period estimation while scheduling have been proposed in 13] . However, they do not consider the e ect of the delay of the controller part of the circuit.
A delay estimation technique using topological and path sensitization has been proposed in 14] . However, the method relies on true path delay analysis at the gate-level, which is prohibitively expensive for the data path part of the design. Hence, the technique can estimate the true delay of only the control part of the design.
Functional information has been used for a re ned timing analysis at a high-level instead of at the gate-level in 15] . The concept of functional false paths which arise due to unused functionalities of chained multi-function ALU units is de ned. While computing the delay, the functional false paths are avoided. However, false paths due to resource sharing and the e ect of control signals on the datapath delay are not considered.
False paths can exist in control ow graphs (CFG) of behavioral descriptions 16] . It was shown in 16] that the presence of false paths in a CFG can lead to false (unsensitizable) paths in the implementation. Techniques were introduced to identify and eliminate the false paths in CFGs, thereby eliminating unsensitizable paths in the implementation due to false paths in the CFG.
The New True Delay Estimation Approach
In this paper, we address the problem of true delay estimation during high level synthesis of behavioral speci cations. We show that even when the speci cation does not have false paths, resource sharing can introduce false paths in the implementation, as has been earlier observed in 17] . We provide a comprehensive analysis of the sources of false paths during resource sharing and assignment which are di erent from the sources mentioned in 15] . Based on the analysis, we propose a two-phase approach to true delay estimation at the high-level: (i) partitioning the paths of the delay graph, an abstract representation of the nal RT-level implementation, into two sets, a complete determining path set and a non-determining path set and (ii) estimating the topological delay of the longest path in the complete determining path set, which is shown to be a close estimate of the true delay of the circuit.
We show that the paths in a circuit can be partitioned into two sets, the complete determining path set CDP, and the non-determining path set NDP. Both sets may include sensitizable paths, but we will prove that for any sensitizable path in NDP, there is a longer path in CDP. Thus the delay of the longest path in CDP is greater than or equal to the true delay of the circuit. Consequently, the correct clock period of the circuit can be computed by measuring the topological delay of the paths in CDP.
A CDP and NDP partition is not necessarily unique. In a trivial partition, the set CDP consists of all the paths in the circuit. In this case, the delay of the longest path in CDP is the topological delay of the circuit and may not be an useful estimate of the true delay of the circuit.
In this paper, we show how to identify one possible non-trivial CDP and NDP partition of the paths of the delay graph, using the information about scheduling and resource sharing available during the high level design process. The sets in the partition identi ed by our method are named CDP R and NDP R . The set CDP R consists of register to register paths in the delay graph that execute the sequences of operations in each state of the schedule. The set NDP R on the other hand consists of register to register paths in the delay graph which are created due to resource sharing, but the complete paths do not execute any sequence of operations in any state of the schedule. The e ectiveness of the proposed partitions is demonstrated by the experimental results discussed later, which show that not only a signi cant percentage of the paths belong to the set NDP R and hence do not need to be considered to estimate the true delay, but the longest path in CDP R is considerably shorter than the longest path in NDP R and the true delay estimates obtained compare very well with the actual true delay calculated at the gate-level. The second phase of the true delay estimation approach consists of estimating the topological delay of the longest path in the set CDP R . The delay of the paths in CDP R can be calculated in two ways: (i) developing and using RT-level delay models for the various components of the delay graph, or (ii) mapping the paths in CDP R to a set of paths in the corresponding gate-level or transistor-level implementation, and using more accurate delay models, including wiring delay. Since the intended use of the proposed true delay estimation is in an iterative high-level synthesis framework, we take the rst approach, which leads to fast estimation at the delay graph level.
The proposed technique for true delay estimation has been incorporated into a high level timing analysis tool FEST. Experimental results on a set of benchmarks reveal the following: about 50% of all paths are in NDP R , which can be ignored for true delay estimation, and the true delay estimates are on the average 15% less than the topological delay. The high level true delay estimates are accurate, as veri ed by comparing with the true delays obtained by gate-level timing analysis on actual implementations. The accuracy of the high level true delay estimates con rm the quality of the CDP R /NDP R partitioning, as well as the delay modeling done in the second phase. Furthermore, results reveal that high level true delay estimation can be done very fast, even when gate-level true delay computation, using expensive path sensitization techniques, becomes infeasible.
In Section 2, we explain the basic concepts in true delay analysis with respect to simple gates. In Section 3, we brie y describe the high level synthesis process. The motivation for high level true delay estimation is explained in Section 4. Since we intend to perform timing analysis on RT level circuits produced by a high level synthesis process, the timing analysis concepts of Section 2 are extended to complex gates and RT level components in Section 5, and are used to de ne the properties of the CDP/NDP partitions which allow fast true delay estimation. It is shown that to determine the correct clock period, it is necessary to consider only paths in the set CDP. A detailed analysis of the sources of paths in NDP R (which is a NDP set) due to resource sharing in high level synthesis is given in Section 6. The analysis is utilized in Section 7 to identify the paths in the CDP R set using high-level information without using path sensitization. Delay estimation of paths in the set CDP R based on the underlying implementation, is explained in Section 8. The complete algorithm for generating the set of paths in CDP R and estimating the true delay of the circuit from the delay of the paths in CDP R is given in Section 9. Experimental results are presented in Section 10, followed by applications of our delay estimation technique in Section 10.1 and conclusions in Section 11.
Basic Concepts in Timing Analysis
In this section we de ne certain properties of acyclic combinational circuits. We assume the oating mode of operation in which the initial state of the circuit is unknown. Much of the notation in this section has been taken from 2]. The delay of a gate g and lead f are denoted by d(g) and d(f). Let p = (f 0 ; g 1 ; f 1 ; . . .; g m?1 ; f m?1 ) be a path in the combinational circuit, where f i is a lead and g j is a gate. The leads f 0 and f m?1 are a primary input and output respectively. All inputs to g j other than f j?1 are called side-inputs of gate g j .
A logic value is the controlling value of a gate if the logic value at an input of the gate determines the gate output independently of the other inputs. Otherwise the logic value is called a non-controlling value. The controlling and non-controlling values for a gate g are called c(g) and n(g) respectively. For example, if g is an AND gate, c(g) = 0 and n(g) = 1.
On applying a primary input vector v at time t = 0, eventually the logic value at every node in the circuit will stabilize. The stable logic values under v at any lead f and output of gate g is denoted as sv(v; f) and sv(v; g) respectively. The times when these values become stable are denoted as st(v; f) and st(v; g) respectively.
De nition 1 We say that f i dominates g i+1 if any one of the following conditions are true.
De nition 2 A path p is sensitizable if there is at least one input vector v under which every lead f i on p dominates g i+1 , 0 i m ? 2.
Sensitizable paths are often referred to as true paths while unsensitizable paths are referred to as false paths.
De nition 3 The true delay of a circuit is the delay of the longest sensitizable path in the circuit. A clock period of a circuit greater than or equal to its true delay is a correct clock period for the circuit. 3 High Level Synthesis: An Overview
In this section, we give a brief overview of the high level synthesis process. In the next section, we show how resource sharing in high level synthesis can create a set of paths which can not a ect the clock period of the circuit. High level synthesis starts with a behavioral speci cation of the target circuit along with constraints imposed by the designer. The goal is to design a circuit which implements the behavior while satisfying the constraints. We explain the three main phases: scheduling, assignment and RT-level circuit structure derivation. Scheduling a behavioral description consists of clustering the operations of the behavioral description into states such that the operations in each state can be executed in one clock cycle without violating design constraints 18] . Executing the schedule is equivalent to executing the behavior. Consider the control ow graph (CFG) shown in Figure 2 (a) which has been extracted from the behavioral description of the dealer process 19] . The resources available after resource allocation are: one (+) adder, one (+=?) ALU, one (<) comparator and one (! =) comparator. A schedule is given in Figure 2 (b). Scheduling operation 8 in the same state as operations 1 to 7 with the allocated resources would have created a combinational loop 20] . We rst provide the conditions under which two operations can share a resource and then explain why including operation 8 in the same state as operations 1 to 7 will create a combinational loop.
A pair of operations can be assigned to the same resource if, (a) they are never executed in the same clock cycle and, (b) it is possible to execute the operations on the resource. Two operations are never executed in the same clock cycle if (a) they are scheduled in separate states, or (b) they are in the same state but on mutually exclusive paths.
Let us assume that all the operations 1 through 8 are in the same state. Since operation 8 is not mutually exclusive to any of the operations 2, 6 or 7, it would have to be exclusively assigned to the adder unit. Operations 2, 6 and 7, being mutually exclusive, can be assigned to the ALU. There is only one comparator. Hence both operations 3 and 5 would be assigned to the comparator. Since operation 3 depends on operation 2 for its data, there would be a path from the ALU to the comparator. Operation 5 decides whether operation 6 or 7 should be executed on the ALU, hence there is a path from the comparator to the ALU. This creates a combinational loop between the comparator and the ALU 20] . To avoid combinational loops which can not be handled by synthesis tools, we schedule operation 8 in state s 1 .
The assignment phase binds or assigns operations to available resources. However, for a xed resource allocation and schedule, multiple feasible assignments of operations to the resources may exist. One of the possible assignments for our example is shown in Figure 2 (b). Mutually exclusive operations have been assigned to the same resource. Note for example that operations 3 and 5 have been assigned to the (<) unit.
An implementation for the given schedule and assignment is shown in Figure 2 (c). The implementation consists of registers, functional units, multiplexors and control logic. The registers store the value of variables. The functional units implement the functions in the behavioral speci cation. The transfer of data between the functional units and the registers are controlled by the multiplexors and control logic. The interconnections in the circuit are de ned by data and control dependencies as we illustrate in the following paragraphs.
In the implementation shown in Figure 2 registers corresponding to Card and Incr to the (+) unit to which operation 2 has been assigned, since operation 2 gets its data from Card and Incr. Since operation 8 also has been assigned to the (+) unit, multiplexors at the input of the (+) unit are required to select the data inputs depending upon whether operation 2 or operation 8 is executed on the (+) unit. Similarly, there is a path from the (+) unit to the (<) unit, since operation 3 which is assigned to the (<) unit depends for its data on the result of operation 2 which has been assigned to the (+) unit. Control dependencies arise due to sharing of resources. Operations 3 and 5 have been assigned to the (<) unit. Since operation 1 decides whether operation 3 or 5 should be executed, the output of the (! =) unit to which operation 1 has been assigned controls the muxes at the input of the (<) unit.
In our implementation, the state of the schedule is stored in the state register. Figure 2 (c).
High Level True Delay Estimation: Motivation
High level synthesis can produce false paths in the implementation. Since many decisions in the design process of a circuit are based on the clock period or an estimate of the clock period of the circuit, true delay estimation allows cost e ective implementations. However, true delay estimation at the gate level can be prohibitively expensive and is impractical in an iterative high level synthesis framework. This motivates us to investigate the possibility of faster techniques for true delay estimation using high level information. Reconsider the control ow graph (CFG) shown in Figure 2 (a). Assume that the user speci ed clock period constraint is 50 ns. As before, the allocation of resources are: one (+)adder, one (+=?) ALU, one (<)comparator and one (! =)comparator. A schedule which satis es the resource constraints and also minimizes the total number of clock cycles to execute the behavior is shown in Figure 2 (b). Derivation of the schedule and the implementation in Figure 2 (c) was discussed in Section 3. The longest path p in the implementation, (state; CASE State of logic; (+); (<); ALU; Card), has delay of 62.5 ns. The circuit will function correctly if clocked at intervals of 62.5 ns or more. However, this is unacceptable since the delay exceeds the speci cation of 50 ns. We propose alternate solutions which might satisfy the clock period constraint.
The rst alternative to be discussed introduces more states in the schedule. Note that operation 2 and 3 in the CFG are both in the same state s 0 and have been assigned to the (+) and the (<) unit respectively in Figure 2 (c). Since operation 3 depends on operation 2 for its data, there is a chaining between the (+) and the (<) units on p. Scheduling operation 2 and 3 in di erent states would break this chain and the longest path p. The delay of the longest path in the circuit corresponding to the modi ed schedule is 40 ns and satis es the clock period constraint. However, the modi ed circuit would now require an extra state (clock cycle) to execute some of the CFG paths, for example the path consisting of operations (1; 2; 3; 4; 8).
Thus extra clock cycles would be required to execute the speci cation.
The second alternative as discussed below requires more resource units. There is a path from the (<) unit to the (+=?) unit since operations 6 and 7 have been assigned to the (+=?) unit and operation 5 which is assigned to the (<) unit decides whether operation 6 or 7 should be executed. This path could be broken if operation 6 and 7 were assigned to di erent functional units. Though this would break p and produce a circuit satisfying the clock period constraint, it would require more resources.
However, in Section 6 we show that resource sharing in behavioral synthesis can create unsensitizable or false paths which do not a ect the delay of the circuit. One such unsensitizable path which is created due to resource sharing is the longest path p in Figure 2 (c). The longest sensitizable path in the circuit is (state; Case State of logic; (+); (<); Card). This path has delay 49.0 ns, and hence satis es the clock period constraint. If true delay estimation was done during synthesis, then the implementation in Figure 2 (c) corresponding to the assignment in Figure 2 (b) would satisfy the clock period constraint. Consequently, neither of the two more expensive solutions discussed above would have to be used to satisfy the clock period constraint.
It is assumed that a schedule of a behavioral description and an assignment of the operations in the schedule to the available resources is given. The goal is to compute the true delay of the resulting implementation in the presence of false paths. In the next section, Section 5, we extend the concepts of gate level true delay analysis introduced in Section 2 for circuits with complex gates since we will be considering circuits at the RT level. The concepts de ned are used to prove properties of circuits that can be used for fast high-level true delay estimation.
Estimating True Delay from Paths Determining Output
The traditional timing analysis concept of sensitization partitions the paths in a circuit into two sets, the set of sensitizable paths, and the set of unsensitizable paths. We de ne the concept of determining paths which partitions the set of all paths into the complete determining path set or CDP and the non-determining path set or NDP. We prove an important relationship between the sets created by the above two partitioning schemes. The relationship allows true delay estimation of the circuit by measuring the topological delay of the longest path in the CDP set. As explained in subsequent sections, we use high-level information to compute a complete determining path set and use it for fast derivation of an estimate of the true delay of circuits. This avoids the need for expensive techniques like path sensitization for computing the true delay of the circuit.
We begin this section by generalizing the concepts of dominating inputs and path sensitization for circuits with complex modules like adders and multiplexors. We next de ne and illustrate the concept of determining paths. Finally we prove the relationship between paths in a complete determining path set and the delay of the longest sensitizable path (or the true delay) of the circuit which allows true delay estimation from the paths in the complete determining set. In the following discussion, a gate g may be a simple gate or a complex module.
De nition 4 For a given assignment of values to the inputs of a gate, the controlling set for the gate is a subset of the inputs of the gate which, for the given assignment of values, determine the output of the gate regardless of the values assigned to the remaining inputs of the gate.
Example 1 Consider a simple AND gate with inputs x and y. If (x; y) = (0; 1), S = fxg is a controlling set. If (x; y) = (0; 0), then either S = fxg or S = fyg is a controlling set. If (x; y) = (1; 1) none of the inputs by itself forms a controlling set. The controlling set in this case is S = fx; yg. In every case, the set of all inputs S = fx; yg is a determining set. 2 Example 2 Consider a multiplexor which is a complex gate. It has a select input sel. When sel is 0, input x is selected, else input y is selected. When (sel; x; y) = (?; 1; 1), S = fx; yg is a controlling set since irrespective of the value of the sel signal, the output will be \1".
When (sel; x; y) = (0; 0; 0), S = fsel; xg and S = fx; yg both are controlling sets. However, if (sel; x; y) = (0; 1; 0), then S = fx; yg is not a controlling set but S = fsel; xg still is. The complete set of inputs, in this case fsel; x; yg, is always a controlling set for all possible input values. 2
Let the arrival time of an input x in a controlling set S be st(v; x). For a simple gate, it is assumed that the delay from any input to the output of the gate is the same. For a complex gate, the delay from an input to the output in general will depend upon the input under consideration. For our implementation of a 2-input multiplexor, the delay from the select input to the output is greater than the delay from a data input to the output. The delay from the input x to the output of the gate is in general given by d(g; x). The delay of the controlling set S at the output of the gate g for input vector v is given by D(g; v; S) = maxfst(v; x) + d(g; x) j 8x in Sg. However, since there may be more than one controlling set for the same input vector v, the output of the gate settles down to its nal value at time st(v; g) = minfD(g; v; S) j 8S; S is a controlling set for gate g under input vg. Now we are ready to de ne a dominating input for complex gates. It can be shown that when the gate is a simple gate, the following de nition is equivalent to the corresponding de nition of a dominating input in Section 2.
De nition 5 For an input vector v, an input x to a gate g is dominating if and only if all the three following conditions are satis ed. The input x (1) belongs to a controlling set S for g under v, (2) D(g, v, S) has the minimum value of all controlling sets for gate g under vector v and (3) x is the input in S which maximizes (st(v; x) + d(g; x)).
De nition 6 A path p is sensitizable if there is at least one input vector v under which every lead on p dominates the gate to which the lead is an input.
We next de ne the concept of determining paths in a circuit corresponding to an input vector v. Intuitively, given the input vector v, one only needs to simulate the paths in the determining set to compute the output of the circuit.
De nition 7 The determining path set for a given input vector v to an acyclic combinational circuit is a non-empty set of paths DP(v) with the following properties:
1. Every gate whose output drives a primary output of the circuit lies on some path in DP(v).
2. For every gate g which is on some path in DP(v), there exists a controlling set of inputs S to g such that each input in S is a lead on some path in DP(v).
De nition 8 A set of paths CDP is a complete determining path set for a circuit if for all input vectors v, DP(v) CDP. The complementary set NDP is the set of non-determining paths. CDP = S (8v) DP(v) NDP = fp j p is a path in the circuitg ? CDP We present two examples to illustrate the concept of determining paths. We show that for every sensitizable or true path in the non-determining path set, there is always a longer path in the determining path set. This relationship, proved in Theorem 1, implies that the delay of the longest path in the determining path set is greater than or equal to the true delay of the circuit, and can be used to estimate the true delay of a circuit.
Example 3 Consider the circuit in Figure 1 . It has four paths p 1 = (i 1 ; g 3 ; f; g 4 ; o), p 2 = (i 1 ; g 2 ; e; g 3 ; f; g 4 ; o), p 3 = (i 2 ; g 1 ; c; g 2 ; e; g 3 ; f; g 4 ; o), and p 4 = (i 2 ; g 1 ; g; g 4 ; o). We have seen that p 2 is the longest sensitizable path and has a delay of 3 units. Also, p 3 is unsensitizable.
Let v = (i 1 ; i 2 ) = (1; 0). A possible determining set is DP(v) = fp 1 ; p 2 ; p 4 g. For example consider gate g 2 on p 2 . Input i 1 is on p 1 and quali es to be in the controlling set S for g 2 . Input c is on p 3 which is not in DP(v) and hence does not qualify to be in the set S. However, since i 1 = 1 and g 2 is an OR gate, S = fi 1 g is a controlling set. The output of g 2 is 1. For g 3 , inputs d and e are on p 1 and p 2 respectively both of which are in DP(v). Since both the inputs are 1, together they form a controlling set for the AND gate and the output of g 3 is 1. For g 4 , input f and g are on p 1 and p 4 respectively both of which are in DP(v). Both of them are 1 and control the output to 1. It can be shown that DP(v) = fp 1 ; p 3 ; p 4 g is another determining set for the same input vector v = (1; 0).
Let v = (i 1 ; i 2 ) = (0; 1). In this case DP(v) = fp 1 g. Similarly, it can be shown that when v = (0; 0), DP(v) = fp 1 g. When v = (1; 1), DP(v) = fp 4 g.
Since a CDP set is an union of DP(v) sets for all possible values of input v, two possible CDP/NDP partitions for the above circuit are, CDP 1 = fp 1 ; p 2 ; p 4 g and NDP 1 = fp 3 g, and CDP 2 = fp 1 ; p 3 ; p 4 g and NDP 2 = fp 2 g.
In Figure 3 we illustrate the relationship between the CDP and NDP sets, the set of sensitizable paths SP, and the set of unsensitizable paths UP. For the CDP 1 /NDP 1 In binary representation, v = (x; y; a; b; c; d) = (10; 11; 01; 00; 01; 10). It can be shown that for vector v, the paths that determine the output f are DP(v) = fp 1 ; p 2 ; p 3 ; p 4 ; p 5 g. In fact, it can be shown that a possible CDP = fp 1 ; p 2 ; p 3 ; p 4 ; p 5 g and NDP = fp 6 g.
We next show that for a particular delay assignment, p 6 is sensitizable. But as observed in the previous example, when there is a sensitizable path in the NDP set, there is a longer path in CDP. In this case p 1 in CDP is longer than p 6 . Consider the following delay assignment:
(>) and (++) units have 2 units of delay, (+) unit has 3, the delay from a data input of a mux to its output is 1 unit, and the delay from its control input to the output is 2 units. We next prove that for every sensitizable path in an NDP set, there is a longer path in the corresponding CDP set. Thus, the delay of the longest path in CDP is greater than or equal to the delay of the longest sensitizable path as was shown in the example 3 and example 4. This property allows us to estimate the true delay of the circuit from the paths in CDP without doing path sensitization. The result proved in Theorem 1 and Corollary 1 con rm the observations in Example 3 and Example 4. The result implies the following. If there is an sensitizable path in the set NDP there will always be a longer path in the set CDP. Hence the delay of the longest path in CDP is a correct clock period for the circuit (De nition 3). However, since CDP might have unsensitizable paths, if the longest path in CDP happens to be unsensitizable, then the delay determined from the longest path in CDP would lead to a pessimistic (but not incorrect) clock period.
We show in Section 6 that resource sharing in high-level synthesis creates paths which are not required to determine the output of the circuit, hence are in NDP. We use the knowledge of the source of NDP paths to give an e cient method for identifying a CDP set in Section 7. The set CDP can be identi ed by using high-level information without explicitly checking for their path determining properties. A brief outline of the proof that the CDP set thus created is indeed a complete determining path set is also given in Section 7. If it is assumed that the only false paths created are due to resource sharing, to compute an estimate of the true delay of the resulting circuit we just have to measure the topological delay of the CDP paths. The signi cance of the method is that it eliminates the need for checking whether long paths in the circuit can be sensitized, a process which makes gate-level timing analysis computationally expensive and impractical for data-path intensive circuits, as illustrated by experimental results.
Sources of Paths Which do not A ect Clock Period
In this section, we provide a comprehensive analysis of sources of paths which do not determine the output of the circuit, and hence are in the set NDP. Many of these NDP paths are false. However, we know from Section 5 that even if a NDP path is sensitizable, there is a longer path in the set CDP. Hence the delay of the longest CDP path is a correct clock period. In this section, we in particular analyze the di erent ways in which sharing a resource can create paths in NDP. The analysis is used in Section 7 to develop an e cient algorithm for computing the CDP set without doing path sensitization.
False Paths in CFG. A path in a CFG may contain one or more conditional nodes. A CFG path is false i the logical AND of all the conditions along the path is never true for all possible values of the variables of the conditionals 16]. False paths in the CFG may give rise to NDP paths in a circuit implementation. A detailed analysis of such false paths in a CFG and an algorithm for their removal is given in 16]. We assume that false paths in the CFG have been eliminated at the beginning of the high level synthesis process.
Functional False Paths. The concept of functional false paths is presented in 15]. Consider an implementation of a behavioral description which has two ALUs such that the output of one ALU is an input to the other ALU, as in the circuit shown in Figure 5 (c). Both ALUs can implement the operations + and ?. However, in the behavioral description, no + operation ever uses the output of a previous ? operation. In such a case, the functionality + of ALU1 is never simultaneously used with the functionality ? of ALU2. The paths corresponding to a simultaneous use of the + on ALU1 and ? on ALU2 is a functional false path. Some functional false paths can be detected only when considering the sequential behavior of the circuit. Such paths are called sequential functional false paths. Functional false paths belong to the set NDP and can never a ect the clock period of the implementation. Both kinds of functional false paths are discussed in 15], along with a technique for doing timing analysis for clock period determination which ignores the functional false paths. Our delay estimation algorithm can compute the true delay even in the presence of functional false paths as brie y explained in Section 9.1.
Non-Determining Paths due to Resource Sharing
In 21] we identify the sources of long paths due to resource sharing in circuits implementing behavioral descriptions. Long paths can be created when operations share the same resource across mutually exclusive paths in the same state, or operations in di erent states share resources explicitly or implicitly. Some of the long paths created by sharing of resources are not executed completely in the same clock cycle. These paths are not required to determine the output of the circuit and belong to an NDP set which we call NDP R .
Sharing Mutually Exclusive Operations. For illustration consider the CFG given in Figure 2 (a). The schedule and assignment is given in 2(b). The circuit implementation is given in Figure 2 (c). There is a path in the circuit from the (+) unit to which operation 2 has been assigned to the (<) to which operation 3 has been assigned, since operation 3 uses the output of operation 2. Operations 6 and 7 are assigned to the(+=?) unit. The (<) unit to which operation 5 has been assigned decides whether operation 6 or 7 should be executed by the ALU. Hence there is a path from the (<) to the ALU in the circuit. Since Implicit Sharing across States. Figure 5 shows the CFG, schedule, assignment and a circuit 
State S 1 op7 --> cmp2
State S 2 op6 --> alu2 op7 --> cmp2 implementation for the UAV benchmark 22]. Note that operation 6 is scheduled in both state s 0 and s 2 . Assigning both occurrences of operation 6 to the same resource unit, ALU2, is termed implicit sharing across states. In the corresponding implementation shown in Figure   5 (c), a path (MaxVal, ALU1; ALU2; cmp2; RTI) is created. The part (MaxVal, ALU1; ALU2) is exercised in state s 0 and corresponds to execution of operations 2 and 6 in s 0 . In state s 2 the part (ALU2; cmp2; RTI) is exercised and corresponds to execution of operations 6 and 7 in state s 2 . Since these two parts are never exercised simultaneously, the complete path is never executed in a clock cycle and is an NDP path. This NDP path was a result of the implicit sharing of operation 6 across states.
7 Identifying the Set of Determining Paths CDP R A CDP and NDP partition is not necessarily unique. In this section, we develop an ecient technique to identify one possible CDP and NDP partition using the information about scheduling and resource sharing available during the high level design process. The sets in the partition identi ed by our method are named CDP R and NDP R . The identi cation of the paths in CDP R is done without explicitly sensitizing the paths as done at the gate level. The paths we consider include both the control and data path of the circuit. After identifying the set CDP R , a simple topological delay analysis on the paths in CDP R gives the true delay of the circuit.
To generate the set CDP R , we rst de ne an alternative representation of the schedule (SchedG) in Section 7.1. Since resource sharing results in NDP R paths, we rst create an abstract implementation of the SchedG without doing resource sharing. The implementation is the sensitizable path graph SensPG which is described in Section 7.2. Since resource sharing has not been done, every path in the SensPG should belong to the CDP R set of the SensPG. To estimate the true delay without actually implementing the design, we next introduce an abstraction of an RT-level implementation in Section 7.3 with resource sharing, which we call the delay graph or DelayG. Using the paths in the SensPG, the corresponding set of paths for the DelayG is generated. It is shown that this set is a CDP R set. The paths introduced in the DelayG which do not have corresponding paths in the SensPG are in the set NDP R . The set CDP R de ned on the DelayG is implementation independent.
A set of rules exist for the derivation of the SchedG, SensPG, the DelayG and the RT implementation. Instead of enumerating the rules, we illustrate how to construct the graphs through examples.
SchedG -The Schedule Graph
A schedule can be represented in more than one way. We de ne one representation, called the schedule graph (SchedG). The SchedG corresponding to the schedule in Figure 2 (b) is shown in Figure 6 . The SchedG explicitly incorporates the state variable and the state transitions (nodes 0, 9, 10 11, 12, 13), along with the operations to be executed in each state of the schedule.
Consider execution of the path (1; 5; 7; 8) in the CFG of Figure 2 last node. Also, there are no parallel constructs in the SchedG. Hence, if there is a node in the SchedG with more than one successor, the execution of the node is followed by execution of only one of the successors. The semantics of the schedule dictate that in every clock cycle, a subset of the operations between the rst node and last node have to be executed. In de nition 9, we de ne exactly which of the SchedG operations are executed in a clock cycle.
De nition 9 Let n 1 and n 2 be two nodes in the SchedG where n 1 is the predecessor of n 2 . If n 1 is a conditional and n 2 is on the true(false) branch of n 1 , then n 2 is enabled if the value of the conditional is 1(0). If n 1 is a CASE node, then if the value of the expression in the CASE node is same as the value on the arc from n 1 to n 2 , then n 2 is de ned to be enabled. Finally, if n 1 is none of the above, then n 2 is de ned to be enabled if n 1 is enabled. The rst node of the SchedG is always enabled. In a clock cycle, the operations in every node that is enabled is executed, and the operation in the node is executed only once.
A path in the SchedG is any sequence of nodes, (n 1 ; n 2 ; . . .n k ), such that there is an arc from n i to n i+1 in the SchedG, 0 i (k ? 1). Node n 1 is the rst node of the SchedG, which in our example is node 0, and n k is the last node which in our example is node 14. A path in the SchedG is executed if every node on it is enabled. The operations on the path executed determine the new value of the SchedG variables.
Lemma 1 In any clock cycle, only one path in the SchedG is executed.
Proof: Let us assume that in one clock cycle, two SchedG paths p 1 and p 2 are simultaneously executed. It should be noted that by de nition, all SchedG paths have the same rst node, which is the rst node of the SchedG. There must be one operation which is on both paths p 1 and p 2 , such that the operation has at least two di erent successors op 1 and op 2 , and op 1 lies on p 1 and op 2 lies on p 2 . Otherwise p 1 and p 2 will not be di erent paths as assumed.
From de nition 9, in any clock cycle, only one successor of a node can be enabled. Hence only one of op 1 and op 2 can be enabled in the same clock cycle. Since a path is executed only if every node on it is enabled, it implies that only one of p 1 or p 2 can be executed in the same clock cycle.
If the paths are executed in sequence, such that the operations on p 1 are executed followed by the operations on p 2 in the same clock cycle, it implies that the rst operation of the SchedG which is the rst operation of both paths will be executed twice in the same clock cycle. Since de nition 9 states that the same operation can not be executed twice in the same clock cycle, it implies that the paths may not be executed in sequence in the same clock cycle. 2
It should be noted that though the SchedG is sequential, if there is no dependency between two operations on the same path in a SchedG, then in a circuit implementation of the SchedG the operations may be executed in parallel in the same clock cycle. For example, in state s 0 of The sensitizable paths graph, SensPG, represents the control and data dependencies between the variables and the operations in the SchedG. The dependencies gives rise to paths in the SensPG. The SensPG is an abstraction of an implementation of the SchedG where resource sharing has not been done yet. Corresponding to each path in the SensPG, there exists one or more paths in the circuit implementation. As we show later, these are the only paths in the implementation whose delay have to be determined for clock period calculation. The SensPG can be constructed from the SchedG and the assignment. We explain how to construct the SensPG in Figure 7 from the SchedG and assignment both shown in Figure 2(b) .
We explain creation of the nodes of the SensPG rst. Consider the nodes on any path p in the SchedG.
(1) If the variable y is being used in the node n 1 on p, create a node r y in the SensPG. Consider path p = (0; 1; 2; 3; 9; 14) in the SchedG and let n 1 be operation 1. Nodes r PresentSuit and r NoSuit are created in the SensPG for the variables PresentSuit and NoSuit. (2) For every variable v that is assigned inside a node on path p, create a node R v in the SensPG. For example, variable Card is assigned inside node 2 on p and hence there is a node R Card in the SensPG.
(3) For every operation op in node n 1 on p, create an operation node (op) n 1 in the SensPG.
Node 3 in p has the operation (<) for which there is a corresponding node (<) 3 There are two principal types of arcs in the SensPG. The data dependency arcs are shown as solid lines, the control dependency arcs as dashed lines. We rst explain creation of the data dependency arcs. Data dependency arcs arise due to data dependencies between operations in the SchedG. (4) Consider the (+) operation in node 2 of the SchedG. It has inputs Card and Incr. Hence, in the SensPG, there are data dependency arcs from r Card to (+) 2 and from r Incr to (+) 2 . Since Card is being assigned to in operation 2 and on p there is no further assignment to Card, there is a data dependency arc from (+) 2 to R Card .
between two nodes n 1 and n 2 in a SensPG, then n 1 is a conditional node (including CASE nodes) and n 2 is a node with multiple data dependency arcs such that the output of n 1 decides which data should be an input to n 2 . To create these arcs, we need to know the assignment.
(6) Operation 3 and 5 have been assigned to the same comparator, a (<) unit. Since the two operations are on mutually exclusive branches of operation 1, operation 1 decides which operation, 3 or 5, should be executed. In the SensPG, the nodes corresponding to operations 1,3 and 5 are (! =) 1 , (<) 3 and (>) 5 . Hence in the SensPG there is a control dependency arc from (!=) 1 to both (<) 3 and (>) 5 . (7) The variable Card is being assigned from various sources, as can be seen from nodes 2, 4, 6 and 7 in the SchedG. Along the path (0, 8, 13, 14) in the SchedG, Card is not assigned and remains unchanged. Given the assignment, it can be shown that the conditionals in nodes 0, 1 and 2 of the SchedG decide the source from which Card is assigned. Hence in the SensPG, there are arcs from (CASE state of) node, (! =) 1 node and (<) 3 node to R Card . Note that there is no control dependency arc from (>) 5 to R Card even though Card is assigned to on the mutually exclusive branches of operation 5. The reason is that both operation 6 and 7 are assigned to the same ALU. Hence the source of the input is same for Card irrespective of the result of the conditional operation 5.
Paths in the SensPG corresponding to a path in the SchedG
Given a path p in the SchedG, there exists one or more corresponding paths in the SensPG.
Two of the paths created in the SensPG corresponding to path p = (0; 1; 2; 3; 9) in the SchedG are, (r State ,CASE state of,(+) 2 ,(<) 3 ,R Card ) and (r PresentSuit ,(!=) 1 ,(<) 3 ,R Card ). We explain how the rst of the above two paths in the SchedG corresponds to p. Since the CASE operation in node 0 of the SchedG decides whether operation (+) 2 should be executed, this creates the subpath (r State ,CASE state of,(+) 2 ). Since the operation in node 3 uses the result of the operation in node 2 of the SchedG, there is an arc from (+) 2 to (<) 3 in the SensPG. Also, if the result of the`<' comparison in node 3 is true, only then Card is assigned the output of the`+' operation in node 2. Hence there is a control arc from (<) 3 to R Card in the SensPG.
Let p be any path in the SchedG. The set of paths corresponding to p in the SensPG are denoted by SensPGmap(p) where SensPGmap(p) = f paths in the SensPG which correspond to a path p in the SchedG g. Lemma 2 In any clock period, let p be the path in the SchedG which is executed for some vector v. Then the paths SensPGmap(p) in the SensPG form a determining path set for vector v.
Proof Follows from Lemma 1 and the de nition of the set SensPGmap. 2 7.3 DelayG -The Delay Graph.
The delay graph DelayG is derived from the SensPG and the assignment by allowing resource sharing as required by the assignment. It closely resembles a circuit implementation of the behavioral description while hiding the actual implementation details. For example, the DelayG shown in Figure 8 corresponds to the assignment in Figure 2 and the SensPG in Figure 7 . Note that the actual circuit implementation of the DelayG is given in Figure 2(c) . We prove some key results based on the DelayG, thus making the results independent of the underlying implementation details. Also, since the DelayG is used for the timing estimation, depending upon requirements, we can make the estimation as accurate as we like by controlling the details of the implementation. In the following discussion, we refer to the DelayG shown in Figure 8 .
Every node in the SensPG is mapped to a corresponding node in the DelayG. Every node of type r x ; R y and CASE in the SensPG has an identical node in the DelayG. The assignment decides the mapping of the operation nodes in the SensPG to the resource unit nodes in the DelayG. For example, (<) 3 and (<) 5 in the SensPG are operation nodes that correspond to operations 3 and 5 in the SchedG. From the assignment, we see that operations 3 and 5 are both assigned to the (<) resource unit. Hence, both these nodes are mapped to the (<) node in the DelayG.
Every resource node and register node of type R x in the DelayG has a Mux node at each data input. Mux nodes have multiple data inputs, but only a single output. For example, the (<) node has M 4 and M 5 at its two data inputs. For every Mux node, there is a control node, whose output goes to the corresponding Mux node. Control nodes C 4 and C 5 correspond to M 4 and M 5 in our example.
At any time, the Mux node allows only one of its data input to be connected to its data output. The decision as to which of its data inputs should be connected to the output is made by the outputs of the control node. Consider nodes C 3 and M 3 corresponding to R Card . Node M 3 has a data dependency input from the (+) node. This corresponds to the data dependency arc from (+) 2 to R Card in the SensPG. It can be seen from the SchedG that this assignment was made in state s 0 if the conditional (PresentSuit!= NoSuit) evaluated to true. The node C 4 has control dependency arcs from the nodes (CASE state) and (!=), and when the former takes on value s 0 and the latter takes on the value true, the input from (+) to M 4 is connected to the output of M 4 .
Paths in the DelayG corresponding to a path in the SensPG
The mapping function DelayGmap maps SensPG paths to the corresponding paths in the DelayG. More formally, DelayGmap(fp j p is a path in the SensPG g) = f paths in the DelayG which correspond to path p in the SensPG g. Consider The set CDP R consists of determining paths in the DelayG while the set NDP R consists of non-determining paths. The latter set arises due to sharing of the same resource amongst multiple SensPG operations, as explained in Section 6.1 and Section 7.3.1. From Lemma 4
and Corollary 1, if p is the longest path in CDP R , then topological delay delay(p) true delay. Hence, the true delay of the circuit can be estimated from the delay of the longest path in CDP R .
Note that depending upon the delay model used to calculate the delays of the paths, the values of the topological delay, true delay, and the true delay estimate, delay(p), might change. However for any given delay assignment, the relationship, topological delay delay(p) true delay, would hold irrespective of the delay values of the individual terms in the relationship.
The delay of the paths in CDP R can be calculated in two ways: (i) developing and using RT-level delay models for the various components of the delay graph, or (ii) mapping the paths in CDP R to a set of paths in the corresponding gate-level or transistor-level implementation, and using more accurate delay models, including wiring delay. Since the intended use of the proposed true delay estimation is in an iterative high-level synthesis framework, we take the rst approach, which leads to fast estimation at the delay graph level, while making the true delay estimate as accurate and close to the gate-level true delay as possible. The delay calculation process for paths in the delay graph is explained in the next Section.
Delay Graph Implementation and Delay Estimation
A path in a DelayG starts at a node of the type r x and ends at a node R y and consists of intermediate nodes. The intermediate nodes can be mux nodes, control nodes or resource unit nodes. Every such path in the DelayG has corresponding paths in an implementation of the DelayG. Given a path in the DelayG, our delay estimator function topo delay est computes the delay of the path, such that it is an accurate estimate of the delay of the corresponding path in the implementation. To estimate the delay of a path, the function topo delay est uses knowledge of the implementation of the nodes in the DelayG. The nodes on a path in the DelayG can be divided into the following categories.
The rst category consists of nodes which implement functions, for example adders, ALUs or comparators and are mapped to standard RTL library units 23]. The delays of individual resource units as well as the delay of cascades of resource units are precomputed and stored in a table, as in Table 1 , for fast lookup. Note that for the (<; =; >) unit, the delay depends upon the function. The delay of a cascade of resource units may be less than the sum of the delay of the individual units 12]. For example, from Table 1 we nd that the delays of an 8 bit alu unit is 16.03 ns. However, the delay of a cascade of two alu units is 20.41 ns and not 32.06 ns as might be expected. If the path has a cascade of resource units, the delay of the cascade is used rather than the delay of individual components.
The second category consists of n-input muxes. An n-input mux is implemented as a balanced tree of 2-input muxes. Let the delay from the control input to the output of a 2-input mux be d(mux; cntrl; data bitwidth). From Table 1 , the delay for the data bitwidths 4 and 8, is 2.95 and 3.41 ns. Also, let the delay from the data input to the output of a 2-input mux be d(mux; data; data bitwidth), which for 4 and 8 bitwidths is 1.94 ns and 2.0 ns for the library modules we use. The delay from the control input to the output of an n-input mux, assuming there are no common data inputs, is given by (d(mux; cntrl; data bitwidth) + d(mux; data; data bitwidth) dlog(n) ? 1e), while the delay from the data input to the output of an n-input mux is given by (d(mux; data; data bitwidth) dlog(n)e).
The third category consists of the control logic which controls selection of the Mux inputs or which select functions for multi-function ALUs. In the DelayG of Figure 8 , c 7 and c 10 are control nodes, where the latter selects an ALU function. The delay estimator assumes that the paths in each of the control logic nodes will be well-balanced by subsequent logic optimization techniques and mapping tools. The delay due to each block is estimated as the number of levels of logic in each block multiplied by a factor , where is the average delay per level of logic. Let there be m inputs to a node of control logic, which controls a Mux node with n unique data inputs. To select each input, we assume that the logic will have delay given by ( dlog(m)e). Since a Mux select signal might be required to select as many as n 2 of the data inputs, there might be a further ( dlog( n 2 )e) delay in the control logic. Hence the total delay of the control logic is the sum of the above two delays.
The Case State of logic block is implemented as a decoder and has delay given by ( dlog(m)e) where m is the bit size of the state register. The r x and R y nodes of the DelayG are implemented as registers. There is a single register x for the two nodes r x and R x . For example, r Card For every node that is on a path in the DelayG, the delay of the node as estimated above is added to the delay of the path. We ignore the e ect of fanout. This is because the subsequent technology mapping stage inserts bu ers to minimize delay due to fanout load. Note that if logic synthesis, including technology mapping, is applied subsequent to highlevel synthesis, the structures of the RT-level circuit, such as multiplexors and functional units, may not be maintained in the optimized gate-level circuit. Thus, the resulting gate-level circuit can have path delays di erent from the estimates obtained by the proposed approach. However, if logic synthesis is applied in a performance driven mode with the true delay estimate as a constraint, the synthesis steps can only improve the delay of the circuit. Hence the true delay estimate serves as an upper-bound of the gate-level true delay, as demonstrated by the experimental results in Section 10. Thus, the proposed true delay estimation technique can be used to explore the design space during high-level synthesis to reliably indicate whether a candidate implementation will meet the given delay constraint.
Algorithm for Clock Period Estimation
The clock period estimation algorithm, FEST, nds an estimate of the correct clock period of the circuit implementing a behavioral description. The inputs to the algorithm are the schedule of the behavior in the form of a schedule graph, SchedG, an assignment of the operations in the SchedG to resource units, and a component library such as shown in Table 1 .
The algorithm rst creates the SensPG and the DelayG as outlined in Section 7. It next creates CDP R , the set of paths in the DelayG which have corresponding paths in the SensPG. The algorithm takes every path which is in CDP R and nds an estimate of the path delay using the function topo delay est. The maximum of the delay of the paths in the CDP R is an estimate of the minimum clock period of a circuit implementation of the DelayG.
The function topo delay est estimates the delay of the implementation of a path in the DelayG. In the previous section, we de ned our implementation of the DelayG and also gave a brief outline of the process of estimating the delay of a path in the DelayG. An outline of the algorithm FEST is given below.
FEST(SchedG, assignment, library) f
Experimental Results
We have implemented the high-level timing analysis tool FEST in C. To evaluate the e ectiveness of FEST, we synthesize the following conditional-intensive VHDL descriptions: the dealer process of Blackjack 19] , the controller for the AutoPilot of an Unmanned Aerial Vehicle (UAV) 22] and a part of the Vender example from 24]. Table 2 shows the estimation results. Each description is scheduled to satisfy the resource constraints speci ed in Figure 2 for the dealer process and in Figure 5 for the UAV example. The relevant portions of the CFG for the dealer process is shown in Figure 2 , and the mapping of the CFG operations under the assignment is shown in Figure 2(c) . The relevant portions of the CFG and the mapping for the UAV AutoPilot process is shown in Figure 5 . A netlist is generated for each RT-level circuit using OASIS 25] . The generated netlists are subjected to technology-dependent delay optimization, including fanout optimization, using the SIS technology mapper 26] and the lib2.genlib standard cell SCMOS 2.0 library 27]. The gate count for each circuit as reported by SIS is given under column Gates. Table 2 shows the results of topological delay (Top Delay) and true delay (True Delay) estimation. To estimate the true delay, FEST considers paths only in CDP R . To estimate the topological delay, FEST considers delay of paths in both CDP R and NDP R .
The number of paths in each set is given under the column CDP R /NDP R . It can be seen that approximately 50% of the RT level paths are in the set NDP R . The results highlight that the CDP R /NDP R partition we compute is non-trivial. It allows immediate elimination of 50% of the paths in the circuit from consideration in the true delay estimation procedure. The partitioning is especially useful since the NDP R partition contains long false paths. This is illustrated by comparing the topological delay estimates with the true delay estimates. As an example, the topological delay of the 8 bit UAV example computed by FEST is 50.9 ns while the corresponding true delay estimate is 41.5 ns. On an average the true delay estimate is 15% smaller than the topological delay estimate.
To establish the accuracy of FEST, the actual topological delay was computed by SIS on the gate level circuit, reported under SIS in the column Top Delay. The true delay computed by the gate level tool gate-TA 5] on the technology mapped netlist is reported under gate-TA in the column True Delay.
Consider the case of the 4 bit implementation of the UAV example: FEST estimates the topological delay to be 34.6 ns while the actual value computed by SIS is 31.6 ns. The true delay is estimated by FEST to be 28.7 ns while the gate level timing analyzer computes a value of 28.0 ns. The data in Table 2 shows that topological delay can be pessimistic at both the gate and the high-level. It also shows that the high level true delay estimates obtained by FEST compare well with the actual true delay of the gate-level implementation, being within 2.5% to 17% of the gate-level true delay.
The CPU times in seconds taken by FEST and gate-TA on a SPARC-10 are reported under the column cpu. The gate level timing analyzer failed to compute the true delay of the 8 bit circuits for the dealer example even after 24 hours. For the vender, it required more than 12 hours. On the other hand, for all the cases, FEST returned an accurate estimate of the true delay in less than two seconds.
Applications in E cient Resource Allocation
The techniques presented in this paper can be applied towards e cient resource allocation. If true delay estimation as opposed to topological delay estimation was used, satisfying a designer speci ed timing constraint may require less resources. We present results in Table 3 illustrating the above phenomenon on two examples, the UAV and the Vender example. For the given resource allocation, a minimum clock period assignment was derived using the ClkMin assignment algorithm 21]. The assignment algorithm used FEST to derive the assignments.
The column Clock Prd Constr gives the clock period constraint that an implementation of the benchmark must satisfy. The column Topo Delay shows estimates of the topological delays of implementations derived by the assignment tool ClkMin under two di erent resource allocations. The second allocation allows an extra ALU. The column True Delay shows an estimate of the true delay of the implementation computed by FEST with the same resource allocation as in the rst column under Topo Delay.
Consider the Vender example and assume that the designer requires a clock period of 50 ns. With the resource allocation of 3 comparators and 2 ALUs shown in Table 3 , an assignment which minimizes the clock period is derived by ClkMin 21] . The topological delay of the implementation is 52.3 ns. Since it does not satisfy the clock period requirement, an extra ALU has to be allocated. With the extra resource, the assignment algorithm returns an assignment whose implementation has a topological delay of 42.5 ns thus satisfying the clock period constraint. However, if FEST was used for delay estimation during assignment, we nd that the allocation of 3 comparators and 2 ALUs su ce. This is because the implementation corresponding to the assignment has a true delay of 45.5 ns satisfying the clock period requirement of 50 ns. Similarly, the other results in Table 3 demonstrate that using true delay estimation exposes a less expensive solution which satis es the required constraint. 
Conclusions
We have presented FEST, a fast and accurate true delay estimation technique which, unlike existing timing analysis techniques, does not rely on path sensitization. Given the high level information on scheduling and resource sharing, FEST partitions the paths of the RT-level implementation into two sets: the complete determining path set or CDP R , and the nondetermining path set or NDP R . It is shown that the delay of the longest path in CDP R is bounded by the true delay and topological delay of the implementation. Consequently, an estimate of the true delay can be computed by measuring the topological delay of the longest path in CDP R . Experimental results demonstrate the ability of FEST to compute the high level true delay estimates very fast, even when gate-level true delay computation becomes infeasible. The results also show that the high level true delay estimates computed by FEST compare very well with the true delay of the gate-level implementation, whenever the latter computation is feasible.
The main component of the proposed delay estimation approach, identi cation of the CDP R set, is independent of the delay models used. Since the intended use of the proposed true delay estimation technique is in an iterative high-level synthesis framework, RT-level delay models were used to estimate the delay of the paths in CDP R . However, more accurate delay estimates can be obtained by identifying the CDP R set at the RT-level using the technique outlined in the paper, mapping the paths in CDP R to the corresponding set of paths in the gate-level or transistor-level implementation, and using more accurate delay models, including wiring delay, to estimate the true delay at the gate-level or transistor-level.
