AbstractÐIn high-performance systems, variable-latency units are often employed to improve the average throughput when the worst-case delay exceeds the cycle time. Traditionally, units of this type have been hand-designed. In this paper, we propose a technique for the automatic synthesis of variable-latency units that is applicable to large data-path modules. We define and study an optimization problem, timed supersetting, whose solution is at the kernel of the procedure for automatic generation of variable-latency units. We contribute a new algorithm for solving timed supersetting in the most difficult case, that is, when the timing behavior of the circuit is expressed through an accurate delay model. The proposed solution overcomes the computational limitations of previous approaches and its robustness is experimentally demonstrated by obtaining high-throughput, variable-latency implementations for all the largest circuits in the Iscas '85 and Iscas '89 benchmark suites, as well as for some realistic, high-performance arithmetic units.
INTRODUCTION
A S performance constraints become tighter, it is increasingly difficult to speed up combinational units simply by reducing their critical path delays. Variable-latency units (i.e., circuits that take a variable, integer number of clock cycles to complete a computation) are frequently used in high-throughput systems to achieve good common-case performance even when the worst-case delay cannot be accommodated within the cycle time. Floating-point arithmetic units [1] are typical examples of circuits of this kind.
The hand-crafted design of variable-latency units is a difficult task. In this paper, we propose a method for automatically transforming a unit with a fixed one-cycle latency into a variable-latency unit with increased average throughput. We call telescopic unit the product of our transformation because the new circuit can ªstretchº the number of cycles required for the completion of a computation, depending on the input values.
The throughput optimization paradigm based on telescopic units can be summarized as follows. We start from a single-cycle fixed-latency unit, defined as the logic between two sets of latches. The minimum allowable cycle time, , of the unit, is equal to its longest delay. We specify a reduced target cycle time Ã` . Then, the input patterns for which the propagation of the input values through the original logic takes longer than Ã are identified. Whenever the unit receives one of these patterns, it completes execution in two clock cycles, otherwise, it completes in one clock. As a last step, a combinational block is automatically synthesized and added to the original unit. The task of such block is to generate a handshaking signal, the hold signal f h , whose value informs the environment when the final result is available at the outputs of the unit.
Average throughput is increased if the number of input patterns for which the unit requires two clock cycles to complete execution is small. If this is the case, the unit will almost always compute a new result in Ã` . In the following, we will show how to determine a bound on the maximum probability of long-propagating input patterns. Another important condition for the applicability of the technique is to control the area, timing, and power overheads caused by the hold function.
A procedure for the automatic synthesis of telescopic units has been proposed in [2] . The main limitation of such procedure is its high computational cost. The algorithmic core which is at the basis of the transformation is a symbolic routine that exploits the expressiveness of algebraic decision diagrams (ADDs) [3] to perform exact circuit timing analysis [4] . Unfortunately, when a complex and realistic model is adopted to describe the gate delays, the algorithm becomes highly memory and time consuming. Therefore, it is usable only for small circuits, i.e., a few hundreds of gates. To partially alleviate this problem, one may resort to a simpler delay model, e.g., the unit delay model. Even in this case, however, the wall of a few thousand gates can hardly be broken, as demonstrated by the experimental data reported in [2] . This is not surprising, since the ADD-based method exactly solves the false path problem, which is known to be NP-complete [5] , independently on the selected delay model.
We propose a new algorithm for the automatic construction of telescopic units that overcomes the computational bottleneck of the ADD-based synthesis procedure. We move from the observation that the automatic generation of telescopic units entails the solution of a general problem that we call timed supersetting (TS for brevity): Find a set of input conditions that include all values propagating to the outputs with delay longer than a given Ã . We study the properties of TS and explore its relationship with classical results in the field of timing analysis. The theoretical investigation leads to the implementation of a core optimization engine that replaces the computationally expensive ADD-based method and enables the synthesis of the hold function for large units (several thousand gates) even when a complex gate delay model is adopted. The technique of [2] computes the true propagation delay for each input pattern; hence, it allows us to determine the minimum set of patterns that solves TS. In contrast, the algorithm of this paper finds a nonminimum solution to TS. Such solution is conservative, that is, it always includes the minimum one.
The downside of the conservative solution is that the hold logic may be activated for patterns that do not actually violate the cycle time constraint. Thus, the telescopic unit may operate with an average throughput that is inferior to what could theoretically be achieved. Nevertheless, the advantages overcome the limitations since the new algorithms for the solution of TS are practical for much larger and more complex units, such as those that can be found in high-performance microprocessors or DSPs.
Throughout the paper, the knowledgeable reader may observe the relationship between TS and the timing analysis problem (i.e., finding the true longest delay of a circuit and a pattern that exercises it). Indeed, a minimum solution to TS and the true longest delay of a circuit can be found by the same ADD-based algorithm; on the other hand, approximate timing analysis methods cannot be directly used for solving TS.
The procedure for automatically synthesizing telescopic units which encompasses the TS solution algorithms of this paper has been benchmarked on the largest Iscas '85 [6] and Iscas '89 [7] examples. Results are satisfactory since an average throughput improvement of 14.1 percent has been achieved at the price of a 6.9 percent average area overhead. In addition, the viability of the presented throughput optimization paradigm has been demonstrated by applying it to real-life, high-performance arithmetic units.
The remainder of this manuscript is organized as follows: Section 2 provides the basic terminology related to timing analysis that will be used throughout the paper. It also recalls the definitions of throughput and latency of a unit, as well as those of some Boolean operators that will be exploited by the algorithms presented in subsequent sections. Section 3 introduces the telescopic units, and briefly summarizes the synthesis procedures proposed in [2] . In Section 4, we formally state the timed supersetting problem and we propose an approximate, yet accurate, algorithm for its solution that can be fruitfully applied for automatically synthesizing large telescopic units. Section 5 reports the experimental results and Section 6 closes the paper with some concluding remarks.
BACKGROUND

Circuits and Delays
A combinational circuit is a feedback-free network of combinational logic gates. If the output of a gate, g i , is connected to an input of a gate, g j , then g i is a fanin of g j and gate g j is a fanout of gate g i . A controlling value at a gate input is the value that determines the value at the output of the gate independent of the other inputs, while a noncontrolling value at a gate input is the value whose presence is not sufficient to determine the value at the output of the gate.
Each gate, g, is associated with two delays, d r g, rise delay, and d f g, fall delay. The delay function of gate g is called dgY x. It equals d r g if g takes value 1 when input vector x x I Y x P Y F F F Y x ni is applied to the primary inputs of the circuit. Otherwise, dgY x d f g.
Given a gate g, the arrival time, e gY x, is the time at which the output of g settles to its final value if the primary input vector x is applied at time 0. Given a maximum delay constraint, the required time, gY x, is the time at which the output of gate g is required to be stable when the primary input vector x is applied in order for the output to stabilize within the maximum allowed delay. The slack, gY x, of a gate g is the difference between its required time and its arrival time, i.e., gY x gY x À e gY x.
A path in a combinational circuit is a sequence of gates, g I Y F F F Y g m , where gate g i is in the fanin of gate g iI . The length of a path, g I Y F F F Y g m is defined as:
An event is a transition H 3 I or I 3 H at a gate. Given a sequence of events, e I Y F F F Y e m , occurring at gates g I Y F F F Y g m along a path such that e i occurs as a result of event e iÀI , the event e H is said to propagate along the path. Under a specified delay model, a path g I Y F F F Y g m is said to be sensitizable if an event e I occurring at gate g I can propagate along . A false path is a nonsensitizable path. The critical path of a combinational circuit is the longest sensitizable path under a specified delay model: Its worstcase length, over all input conditions, is the delay, h, of the combinational circuit and it is a lower bound on the cycle time , i.e., h . For the sake of simplicity, we neglect setup and hold times, and propagation delays through registers. These factors can be easily incorporated into our analysis and synthesis technique.
Topological approximations to arrival times (e g), required times ( g), slacks ( g), and path lengths (d) can be computed through graph algorithms [8] whose complexity is linear in the number of gates involved. Such approximations have two properties: They are conservative and pattern-independent, that is, the following inequalities hold for all possible input vectors x:
The topological critical path of the circuit is the path with longest topological length.
Throughput and Latency
The throughput of a unit is defined as the amount of computation (i.e., the number of times a new output value is produced) carried out per time unit. The latency, v, of a digital system is defined as the number of clock cycles required for a computation to complete. A fixed-latency unit with latency v clocked with period has constant throughput, given by:
For variable-latency units, we consider the average latency v ve over a period of time, tot bb . The average throughput is simply:
In the following sections, we use the shorthand notation v and , as opposed to v ve and ve , to denote average latency and throughput, respectively.
Boolean Functions and Operators
We assume the reader to be familiar with the basic concepts of Boolean functions. In this section, we only review two Boolean operators which are essential for our purposes. Let x x I Y x P Y F F F Y x ni be a vector of Boolean variables. Given a single-output Boolean function, fx, the positive and the negative cofactors of f, with respect to variable x i , are defined as:
The existential abstraction of f with respect to x i is defined as:
The Boolean difference of f with respect to x i is defined as:
TELESCOPIC UNITS
Consider the problem of increasing the throughput of a combinational unit, such as the one shown in Fig. 1a . This can be done by shortening the cycle time of the unit from its original value, , to Ã` . One possible way of ensuring functional correctness is to extend the unit to provide an additional output signal, f h , which is asserted for all input patterns requiring more than Ã time units to propagate to the outputs of the block (see Fig. 1b ).
We call telescopic unit the modified unit since it may require v mx b I cycles to complete its execution, depending on the specific patterns appearing at its primary inputs. We consider here the situation in which v mx P. In this case, the computation completes in Ã time units for patterns such that f h H and in P Ã time units for patterns such that f h I.
Conditions for Throughput Improvement
The average throughput of the original unit is given by: I X Conversely, for the telescopic unit, the lower the probability of the hold signal, f h , to take on the value 1, the larger the overall throughput improvement. In fact, its average throughput, Ã , is given by:
where rof h is the probability of the hold signal being one. The use of the telescopic unit is therefore advantageous only for some values of Ã and rof h , i.e., when Ã b . In particular, we have the following condition for throughput improvement:
It should be noticed that the inequality above is valid only for Ã ! aP since we have made the assumption that v mx P.
The extension to v mx b P is conceptually straightforward, but more complex to implement. This is because several hold signals f
h are required to make the unit work correctly. Function f k h is one for the input patterns that require k I cycles to complete execution.
The expression for Ã can obviously be modified to account for values of Ã` aP; in other words, when v mx b P, the formula that gives the value of the average throughput of the telescopic unit becomes:
where rof j h represents the probability that the unit completes the execution in a time frame between j Ã and j I Ã . For clarity, in the remainder of this work, we focus on the case v mx P.
In order to automatically synthesize telescopic units, two problems must be solved. First, the hold function (that is, a combinational logic function that detects all input patterns that propagate to the outputs with delay larger than Ã ) must be computed and synthesized. Second, the controller of the data-path where the telescopic unit is instantiated must be modified since it must be able to synchronize the environment with the telescopic unit by delaying subsequent computations when f h I. The following section provides some details on the procedures we have proposed in [2] to synthesize function f h . Controller redesign techniques are not discussed here since they are beyond the scope of this paper (the topic is extensively addressed in [2] ).
Synthesis of the Hold Logic
The synthesis of the hold logic critically depends on the capability of finding all input patterns that propagate to the outputs with delay larger than Ã . Such patterns must be included in the ON-set of function f h . In the next section we analyze this problem in detail. Here, we assume the availability of a black-box procedure, ComputeF_h(g, Ã ) that returns the ON-set of f h . The input parameters of such procedure are the initial specification of the unit, g, and the desired cycle time, Ã . ComputeF_h solves the timed supersetting problem that was informally introduced in Section 1. In fact, the minimum solution of TS is the ON-set of the hold function f h that contains all and only those input values that propagate to the outputs of the unit with a delay longer than Ã . Ideally, we would like to implement a hold logic that takes value 1 exactly for the input values corresponding to the ON-set of the hold function. In this way, the unit would require two cycles to complete only for patterns that do propagate to the output in a time longer than Ã . Conversely, we must guarantee that the implementation of the hold logic itself has a delay shorter than Ã and this may not be always possible. Thus, the target is to determine an enlarged hold function, f e h ! f h such that the average performance of the unit is only marginally degraded, but the implementation of f e h meets the timing constraint, Ã and has a well-controlled area and power dissipation.
The heuristics devised in [2] for synthesizing the enlarged function f e h starts from the BDD representation of f h . It generates the hold logic following an iterative paradigm. First, the BDD of f h is mapped onto a multiplexor network. Then, the network is optimized through traditional synthesis techniques; finally, a check is made to find out if the timing constraint f h ` Ã is met. If this is not the case, the ON-set of f h is enlarged, to obtain f e h , by properly removing some BDD nodes and the process is repeated.
THE TIMED SUPERSETTING PROBLEM
In this section, we formally state the timed supersetting (TS) problem and one important variation, called minimum timed supersetting (MTS). The practical relevance of TS and MTS for the synthesis of telescopic units has been outlined in the previous section. For the sake of comparison, we briefly describe the algorithm for the solution of MTS (and TS) presented in [2] . We then take a completely different approach and present the key contribution of this paper, namely a robust and widely applicable algorithm for the solution of TS.
Consider a combinational circuit g with primary inputs
The timed supersetting problem can be formally stated as follows: Problem 1. Find a set of input values x that includes all values which propagate to the outputs o with a delay larger than or equal to a given Ã .
Obviously, TS has always the trivial solution f ni , i.e., the complete Boolean space is guaranteed to include all input values with propagation delay larger than Ã . We are interested in nontrivial solutions of TS. A theoretically relevant solution is the minimum one. The minimum timed supersetting problem consists of finding the smallest set of input values with propagation delay larger than Ã . Formally: Problem 2. Find the set min of all and only those input patterns x which propagate to the outputs o with a delay larger than, or equal to a given Ã .
It is quite easy to prove the NP-completeness of MTS. Solving MTS when Ã is equal to the longest propagation delay of g is at least as hard as finding a single pattern with maximum propagation delay. This problem is NP-complete [5] .
Observe that min , i.e., every solution of TS is guaranteed to contain the solution of MTS. Among the solutions of TS, we are interested in near-minimum solutions. In more detail, we are looking for approximations of min that:
1. Include min ; 2. Are as close as possible to min ; 3. Can be computed in polynomial time and space (in ni). Before discussing our approximation strategy, we review an algorithm for the exact solution of MTS.
Exact MTS Solution
An ADD-based algorithm for the exact solution of MTS has been presented in [2] . The arrival time ADD for each output o i of the circuit is first computed using the algorithm of [4] . Such ADDs provide the propagation delay for any possible input vector. The logic function f o i h x, which assumes the value 1 for all input vectors for which the arrival time of o i is greater than the desired cycle time Ã , is then obtained through symbolic ADD operations. Finally, function f h x collecting the set of input conditions for which at least one circuit output o i has an arrival time greater than Ã is computed by OR-ing together all functions f o i h . The main limitation of the algorithm is its worst-case exponential time and space complexity. When a complex, load-and path-dependent delay model is used, it is impossible to build the arrival time ADDs even for the outputs of relatively small circuits. The memory requirements for such construction are simply excessive. Another shortcoming of this approach is that, when building the delay ADDs, complete delay information is computed, even for patterns that propagate much faster than Ã . The computation of unneeded information contributes to the memory blow-up problem.
Near-Minimum TS Solution
Since the exact solution of MTS is computationally infeasible for large circuits, we resort to algorithms that only solve TS but attempt to find solutions which are as close as possible to the minimum one. Notice the analogy with the approaches used in timing analysis. When exact delay computation is unaffordable, it is possible to resort to safe approximations with various degrees of tightness.
Consider a combinational circuit g with primary inputs x x I Y F F F Y x ni . A gate, g i , of the network is associated with a Boolean function f i y, where y y I Y F F F Y y n g i is the local support of f i . We call p i x, the Boolean function associated with gate g i expressed as a function of the primary inputs (global support).
Let us assume that topological timing analysis has been performed, and that the topological critical path g I Y g P Y F F F Y g m ) has been determined. Let us assume also that the topological length of the critical path, , violates the desired cycle time, namely b Ã . Since we are relying only on topological delay analysis, we conservatively consider the path as a true one. Consequently, all input conditions that activate it must be in the ON-set of the hold function f h .
To find such conditions, from the primary inputs, we move along the critical path toward the output. We call critical input y of a gate g i on the critical path the input which connects it with gate g iÀI . For each gate in the path, we specify the local sensitization function s i as the Boolean function that takes on the value 1 when gate g i is sensitive to the value of y : We call rit the path sensitization conditions. This formula holds because, for a signal to propagate along a path, all gates on the path must be sensitized. We call partial path sensitization conditions ritYj j iI i (with j m) the path sensitization conditions for gates belonging to the path up to level j. Clearly, ritYm rit . The partial sensitization conditions can be computed with the following recursion, for j IY F F F Y m:
A property of (3) is that ritYj ritYk for each j b k, that is, the ritYj s are monotonically decreasing (i.e., the ON-set of rit is monotonically shrinking) with increasing j. Notice that computing the complete rit is equivalent to testing the viability of path . Since this problem is NP-complete, there will be instances for which this computation requires an exponential amount of time or resources. However, the key observation is that we do not have to compute the complete rit to find a conservative set of input conditions for which the circuit delay violates the timing constraint Ã (i.e., the hold function f h ). Any ritYj is suitable for that purpose, because its ON-set contains the one of rit . Example 1. Consider the circuit of Fig. 2 (taken from [5] Although it may appear that (3) provides a viable procedure for finding a near-minimum solution of TS, two major problems need to be addressed:
1. The sensitization conditions of (1) are static, that is, it does not consider the dynamic propagation of the events along the paths. It is a well-known fact that the absence of static path sensitization conditions (i.e., rit H) is not sufficient to guarantee that a path does not propagate events with delays that violate the timing constraint Ã . This phenomenon is known as dynamic sensitization [5] . Notice that every valid solution to TS must include all patterns with propagation longer than Ã . Hence, the approach based on simple static sensitization may lead to incorrect implementations and must be augmented by some form of dynamic sensitization test. 2. Equation (3) has been obtained under the assumption that there is only one path violating the timing constraint. This is not generally true. In almost all practical examples, there are multiple critical paths. Moreover, the number of such paths can be exponential in the number of gates in the network. The complexity explosion caused by the number of critical paths must be addressed in a conservative fashion. In the next two sections, we analyze and solve the above problems. We then describe in detail our strategy for finding a near-minimum TS solution in an efficient and robust way.
Accounting for Dynamic Sensitization
In order to derive sensitization conditions which are correct and conservative, let us consider a gate g i on a critical path (i.e., a path whose topological length exceeds Ã ). Let e y be the topological arrival time at the critical input and g i the (negative) slack of the gate. Let fw I Y F F F Y w p g denote the set of side inputs, and e w i Y i IY F F F Y p their topological arrival times. We present the following safe conditions for declaring a path as false. Theorem 1. Given the topological arrival times, required times, and slacks for all gates belonging to path , the static sensitization conditions of (1) are correct if, for all gates g i of :
e y g i b ew i Y Vw i P X R Proof. The topological arrival time is an upper bound to the actual arrival time; then, the topological slack is always more conservative (i.e., smaller) than the actual slack. Hence, all transitions on y that take place before time min e y g i cannot arrive late at the outputs. If all side inputs w i are early enough and stabilize before min , any transition on y that could arrive late at the outputs does find the side inputs already stable at their final value. Thus, static sensitization can be used to assess if the values of the side inputs filter out the propagation on the critical path.
t u This criterion may be extremely conservative in some cases because it prevents us from using the sensitization conditions for a gate g i if any of its side inputs do not satisfy (4). In the vast majority of cases, only some of the inputs violate the inequality. When the inputs that satisfy the inequality have controlling value, the gate still filters out events on the critical input. Therefore, we can relax the conditions stated by (1) . This can be done by exploiting some of the results available in the literature on timing analysis.
A well-known criterion which is particularly suitable for a BDD-based symbolic implementation is the one introduced by Brand and Iyengar [11] . In that work, the sensitization conditions (1) are overestimated by abstracting a set of the local gate inputs.
The key point with this approach is selecting which and how many inputs should be abstracted. Inequality (4) provides the criterion to do that. If we call e f e w I Y F F F Y e w k g the set of side inputs that do not satisfy (4), the comprehensive criterion for robustly and correctly detecting a false path, at a gate g i , becomes:
Similarly to the static sensitization conditions, we can extend ' i y to the global support of the circuit, and compute AE i x as a function of the primary inputs x. The sensitization conditions for the entire path are then given by the intersection of all sensitization conditions of the gates g I Y F F F Y g m . In formula: (4) is used to decide which side inputs arrive too late and should be quantified out from the sensitization conditions. Clearly, the procedure does not solve MTS exactly since conservative and pattern-independent topological estimates of the arrival times and slacks are used. In other words, (4) is a sufficient, but not necessary, condition for deciding whether a side input stabilizes before the critical input arrives.
Example 2. Consider again the circuit of Fig. 2 . In Example 1, we have found that the static sensitization conditions for critical path Y dY fY z is null (i.e., rit H). However, the path is not false. This can be verified by inspection of the circuit: The output stabilizes after S time units when the inputs have the following transitions: X I 3 H, X I 3 H, and X H 3 H. Hence, the circuit is a counter-example that proves the insufficiency of static sensitization to solve TS. We set the required time to Ã R. The arrival time of the inputs is 0. The arrival times of the outputs of the gates are: e d P, e f R, e e P, and e z S. The slacks on the critical inputs are: f ÀI, d ÀI, and ÀI. Now, we apply our technique for computing the path sensitization conditions AE rit . We start with the NOR gate with output d. The critical input is and ddad . Inequality (4) is not satisfied for side input because e ÀI and e H. Thus, we must quantify out from the sensitization conditions of the gate: W ddad I. Inequality (4) is satisfied for the remaining two gates on the critical path. They contribute to the path sensitization conditions with dfadd H and dzadf e. The final path sensitization conditions are AE rit Ie H H H H , which are obviously not null.
Dealing with Multiple Paths
So far, we have described a robust, yet simple, algorithm for finding a near-minimum TS solution which is applicable to the cases where a single critical path is present in the circuit. In this section, we present an algorithm (see the pseudocode in Fig. 5 ) to find a near-minimal solution to TS (i.e., the hold function f h for a telescopic unit) in the case of multiple critical paths. The procedure receives, as inputs, circuit g and the desired cycle time Ã , given as an absolute time value or as a percentage of the actual critical delay. It initially performs (Line 1) static timing analysis, computing arrival times, required times, and slacks for each gate. Then, the network is levelized (Line 2), that is, the gates are grouped into the list vevels according to their topological level, starting from the primary inputs, which are assumed to be at level H. Starting from level H, the critical gates (i.e., gates with negative slack) at each level are processed (Line 4), and a Boolean function ep x (Path Activation Function) is computed as follows: At each gate, the function is obtained by summing, over its critical fanins, the product of two quantities: 1) The path activation function of the ith fanin ( ep i x in Line 7); 2) The sensitization conditions AE i x, computed with (6). Clearly, the PAF for each primary input is assumed to be 1. The output of the procedure is a nearminimum solution of TS or, equivalently, the hold function f h of a telescopic unit. It is computed in Lines 8 and 9, by accumulating the PAFs of all critical gates that are connected to an output.
The rationale of the algorithm is that every critical gate filters the activation conditions of a critical input i by ANDing the AE i of the input to the conditions for which an event propagates up to input i (i.e., ep i ). If a gate has more than one critical input, its PAF is the sum of the filtered PAFs of its critical inputs.
An important feature of the algorithm is that it is based on a traversal of the critical gates and not of the critical paths. In fact, the number of (critical) paths can be exponential in the number of gates in the network, whereas the number of critical gates is guaranteed to be smaller than the number of gates.
Note that the algorithm relies on topological timing analysis. It is a well-known fact that such estimates can be very conservative. In the limiting case, if the topological delay is longer than the true delay and the true delay is shorter than Ã , we may actually synthesize useless hold logic. This is due to the fact that our procedure is conservative and it may actually flag as belonging to f h some input conditions that do not propagate any perturbation to the output. Observe, however, that the accuracy of the procedure can be improved if more powerful algorithms for the computation of the arrival times are used (see, for example, [12] , [13] ). The modification of the pseudo-code in Fig. 5 is straightforward: It is sufficient to replace the StaticTimingAnalysis call with the call to an advanced timing analysis procedure. On the other hand, the computational burden of obtaining accurate delay information for all gates in the network may be substantially higher than that required by simple static timing analysis. In summary, the StaticTimingAnalysis should be replaced by the procedure that is used for timing analysis in the design flow.
Cutting Heuristics
Although the algorithm of Fig. 5 does not suffer the computational bottleneck of the exact method of [2] , there may be circuits for which constructing the BDDs for the sensitization function is still not feasible. In these cases, an approximate solution is required that allows us to compute partial timing information.
A simple solution may be that of stopping procedure ComputeF_h after a desired number of levels or, alternatively, when the sizes of the BDDs grow beyond a given threshold. Unfortunately, this would result in incomplete timing information since some critical paths could be incorrectly left out of the computation. In fact, computing the hold function by levels does not necessarily take into account all critical paths unless we guarantee that last level (i.e., the primary outputs) is reached.
The observation above suggests a criterion for computing the timing information incrementally. The key for such criterion is to progressively select sets of critical gates, hereafter called cuts, such that the gates in a set cut all critical paths. If we can compute the BDDs (in the global support) of the path activation functions of all gates in a cut q, a solution of TS (i.e., a valid f h ) is simply:
A good cutting heuristics is obviously essential for an effective realization of the f h computation algorithm. The one we propose starts from the critical inputs (cut q H ) and consists of the repeated application of three phases until no gate in the combinational circuit is left:
1. From a cut q i , we reach the critical gates in the fanout of any gate in q i . Only critical connections (i.e., connections from the output of a critical gate to the critical input of another critical gate) are explored. A newly reached gate is marked as belonging to the new cut q iI only if all its critical fanins belong to a previous cut q j , j HY IY F F F Y i or to cut q iI itself. 2. If at least one gate has been marked, we check if all critical fanins of some additional gates reached from q i have been reached. If this is the case, such gates are marked as belonging to q iI . This step is repeated until no new gate is marked. In other words, all gates for which all the critical fanins belong to q j , j i I are marked. 3. The remaining critical gates reachable from q i do not belong to q iI and are discarded. However, to guarantee that all critical paths are cut, we insert in q iI all critical fanins of the discarded nodes which belong to previous cuts (or to cut q iI itself). The set of gates q iI is the new cut. Notice that if an output is reached during traversal at cut q j , such output is inserted in all successive cuts q k , k b j. In addition, it can be easily observed that, in general, successive cuts are not disjoint.
After the computation of q iI , the path activation functions of its gates and f h are computed. The termination conditions of the traversal algorithm are the following:
. If the BDD of a PAF for a gate in q iI blows up, the computation is aborted and the BDD of the f h of the previous cut is returned. . Once all PAFs have been computed, f h is obtained by taking the Boolean sum of all PAFs. If the BDD of f h blows up during the Boolean sum, the computation is aborted and the BDD of the f h of the previous cut is returned. . If the computation of f h in the global support succeeds and the cut is the last one, f h is returned. Conversely, if the cut is not the last one, f h is stored and the next cut is generated. The f h for q H is obviously the most conservative TS solution, that is, f h I. In the worst case, if PAF or f h computation fails at the first cut, the value of f h returned is the tautology. Hence, the procedure is guaranteed to return a valid solution to TS, but it may return the trivial one.
Example 3. Assume that all gates in Fig. 6 are critical.
Initially, q H fY Y Y dg. For generating q I , the fanouts of gates in q H are explored (notice that here all connections are critical because all gates are critical), i.e., gates feY fY hg. First, only e is marked because all its critical fanins belong to q H . Then, f is marked because belongs to q H and e was previously marked. Gate h is discarded. The new cut is then: q I feY fY dg. Gate d is included in q I to guarantee that all critical paths are cut. N o t i c e t h a t q H q I T Y. The f h f o r q I i s f h ep e ep f ep d . The third and last cut, q P , is finally computed and it consists of gates g, h, and f. Gate f is included in q P because it is a primary output.
The algorithm for near-minimum TS solution described in Section 4.2.2 is modified by replacing a level-based traversal with a cut-based traversal. In this way, ComputeF_h is guaranteed to always return a valid solution to TS, even in the case of BDD blow-up. 
EXPERIMENTAL RESULTS
We have implemented the algorithms for TS solution described in Section 4 within the tool for telescopic units synthesis of [2] . The logic synthesis framework we have exploited is SIS [14] which uses CUDD [15] as the underlying BDD package. Experiments have been run on a DEC AXP 1000/400 with 256 MB of main memory. We applied our technique to standard benchmarks as well as realistic high-performance arithmetic units. The results of our experiments are summarized in the next two sections.
Standard Benchmarks
We have considered all circuits in the Iscas '85 [6] suite with more than 1,000 gates. Since only six examples were available, we have also experimented with the combinational logic of the 12 largest Iscas '89 [7] (addendum included) benchmarks.
The library used for mapping consisted of 2-to 4-input NAND and NOR gates, plus inverters and buffers, each of which had five different driving strengths. The gates are nonsymmetric, that is, they have different pin-to-pin delays, as well as different rise and fall delays. The delay model used is the SIS real delay.
The original circuits have been first optimized for speed using a modified version of script.delay, where the full_simplify and sometimes the rr commands have been removed to allow the optimization to complete on the large examples and then mapped for speed with either map -m1 or map -n1 -AFG. Table 1 reports the experimental data. Columns Circuit, I, O, G, , and give the name, number of inputs, outputs, and gates, the static delay (in nsec), and the throughput of the original circuit. Column rof h shows the probability of f h , column q Ã gives the total number of gates of the telescopic unit, column Ã reports the cycle time (in nsec) at which the telescopic unit is clocked to achieve the increased throughput of column Ã , and column f h tells the (static) arrival time of the hold signal (in nsec). Columns Á and Áq give the throughput improvement and the area overhead (in terms of gates) of the telescopic unit. Obviously, the area overhead only refers to the additional logic implementing f h , while it does not consider the circuitry required to control the operations of the telescopic unit, the latter being dependent on the specific context in which the unit will be instantiated. Finally, column Time reports the CPU time (in sec) required to perform the automatic synthesis of f h for a given Ã . A symbol * beside the circuit name indicates that the heuristics of Section 4.2.3 were required to complete the calculation of f h . This has happened only on example s38417, where the computation of f h stopped after 21 cuts (out of 28).
Only two benchmarks are missing from the table: c6288 and s38584. The former is a 32-bit multiplier for which it is well-known that the computation of the BDDs for all outputs is infeasible [16] . The application of the algorithm of Fig. 5 therefore failed; we thus resorted to the heuristic 3 ; also, in this case, however, the result was negative since the computation of f h stopped after 41 cuts (out of 103) with Ã HXWS UHXRH nsec and rof h HXWWWWW. The application of our algorithm to the latter example, on the other hand, has not been tried since a mapped version of it could not be obtained for the selected gate library.
The results are quite satisfactory since an average throughput improvement of IRXI percent has been achieved with an average area penalty of TXW percent. The proposed approach thus demonstrates its scalability and applicability to the largest available benchmarks. Needless to say, most of the circuits examined here are well beyond the capability of the exact MTS solution algorithm of [2] , which, for this library and delay model, fails for circuits larger than a few hundreds of gates. On small circuits, for which exact MTS solution is possible, our tool still achieves improvements around 10-15 percent, while the exact minimum solution allows average improvements around 27 percent [2] .
Arithmetic Units
In this section, we study the application of our technique to two complex arithmetic units. The purpose of this analysis is to show that our automatic transformation is applicable and useful not only on standard benchmarks, whose functionalities and architectures are uncertain, but also for carefully designed units that are used in real-life systems. We have considered two units belonging to the advanced mathematical library of Synopsys' DesignWare components, namely, a combinational multiplier-adder module (called DW02_prod_sum1) and a combinational sine function module (called DW02_sin). These components are hand-coded in synthesizable HDL (Verilog or VHDL) by expert designers and can be instantiated as black boxes in register-transfer level descriptions. Clearly, such library components are specifically designed for high performance, hence, they represent a good test for assessing the effectiveness of our paradigm in pushing throughput beyond the possibilities of standard synthesis techniques.
The multiplier-adder module implements the function e Ã f g, where the width of the operands can be chosen at instantiation time. Furthermore, the unit has a control input g, whose function is to select two's complement versus sign-magnitude representation for the data. The internal architecture of the multiplier is based on a fast carry-save array. For our experiment, we selected a 16-bit width for operands e and f and a 32-bit width for operand g and output .
The sine module implements the function g sine. Input angle e is treated as a binary fixed point number which is converted to radiants when multiplied by %. When e is interpreted as unsigned, the input angle e is a binary subdivision of the range H e`P. When e is interpreted as signed (two's complement), the range is ÀI e`I. The value of the sine function is computed with either a linear, or quadratic, or cubic interpolation scheme, depending on the value of e. The bit width of both the angle and the sine values are parameterized and are chosen at instantiation time. For our experiment, we set both widths to 16 bits.
The gate-level netlists of the two arithmetic units have been generated from the corresponding HDL descriptions using Synopsys DesignCompiler and then translated into the blif format. Two technology-dependent implementations for each unit have been obtained through logic optimization and technology mapping using SIS. In particular, circuits DW02_prod_sum1.a and DW02_sin.a are minimum area realizations, while circuits DW02_prod_sum1.d and DW02_sin.d are obtained from the min-area descriptions by applying delay optimization under area constraints (we allowed a 2X area increase).
Each description has been transformed into a telescopic unit, and Table 2 collects the results of the experiments. Throughput improvements range from 8.1 percent to 19.6 percent, while the area overheads are between 3.7 percent and 10.2 percent.
By direct inspection of the data in the table, it can be evinced that telescopic units provide an area-effective way of improving system's throughput. As an example, consider the telescopic version of circuit DW02_prod_sum1.a. Its throughput is approximately the same as that of the reference circuit DW02_prod_sum1.d(0.00715 versus 0.00760), but its area is substantially smaller (3,801 versus 5,201 gates).
CONCLUSIONS
We have addressed the timed supersetting problem and we have contributed an algorithm for its solution which is wellsuited for the automatic synthesis of large telescopic units in the cases where complex and realistic gate delay models are adopted. Results obtained on the largest benchmarks available in the literature (i.e., the Iscas '85 and the Iscas '89 circuits), as well as on realistic, high-performance arithmetic units are quite satisfactory and confirm that the use of telescopic units represents a robust and flexible alternative for improving the performance of delay-critical digital applications. Antonio Lioy received the DrEng degree in electrical engineering from the Politecnico di Torino, Italy, in 1982, and the PhD degree in computer engineering from the Politecnico di Torino in 1987. From 1987 through 1989, he was a research assistant at the Politecnico di Torino, from 1990 through 1992, he was an assistant professor at the same institution, and, in 1993, he became an associate professor at the Universita Á di Parma, Italy. Currently, he is an associate professor at the Politecnico di Torino. His research interests include simulation and testing of digital circuits and systems, as well as advanced networking technologies and computer security.
Enrico Macii received the DrEng degree in electrical engineering from the Politecnico di Torino, Italy, in 1990, the DrSc degree in computer science from the Universita Á di Torino in 1991, and the PhD degree in computer engineering from the Politecnico di Torino in 1995. From 1991 through 1994, he was an adjunct faculty member at the University of Colorado at Boulder. Currently, he is an associate professor at the Politecnico di Torino. His research interests include synthesis, verification, simulation, and testing of digital circuits and systems. He received the Best Paper Award at the European Design Automation Conference in 1996. He was the technical program co-chair of the 1999 IEEE Alessandro Volta Memorial Workshop on Low Power Design. He is an associate editor of the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
Giuseppe Odasso received the DrEng degree in electrical engineering from the Politecnico di Torino, Italy, in 1994. Currently, he is working toward his PhD in computer engineering at the Politecnico di Torino. His research interests include synthesis, verification, simulation, and testing of digital circuits, with special emphasis on low-power and high-performance systems.
Massimo Poncino received the DrEng degree in electrical engineering in 1989 and the PhD degree in computer engineering in 1993, both from the Politecnico di Torino, Italy. From 1993 through 1994, he was a visiting faculty member at the University of Colorado at Boulder. Currently, he is an assistant professor at the Politecnico di Torino. His research interests include synthesis, verification, simulation, and testing of digital circuits and systems.
