The maximum mean weight cycle problem is well-known: given a digraph G with weights 9 c : E(G) → R, ÿnd a directed circuit in G whose mean weight is maximum. Closely related is the minimum balance problem: Find a potential : V (G) → R such that the numbers 11 slack(e):= (w) − (v) − c((v; w)) (e = (v; w) ∈ E(G)) are optimally balanced: for any subset of vertices, the minimum slack on an entering edge should equal the minimum slack on a leaving 13 edge. Both problems can be solved by a parametric shortest path algorithm.
Introduction 27
In this paper we consider the maximum mean cycle problem in digraphs, the minimum balance problem, and generalizations. We show how these problems apply to 29 a major problem in the design of very large scale integrated (VLSI) circuits. We 1 encounter the rare case that the original practical problem (with all constraints and without any simpliÿcation!) can be solved optimally by e cient algorithms. 3 A main goal in the design of logic chips is maximizing its frequency. All computations on a chip are synchronized by certain storage elements which receive a clock 5 signal periodically. The period is called the cycle time; its inverse is the frequency of the chip. 7 In each period each storage element (latch) stores one bit. In the next period a ÿxed logical function is evaluated; the inputs are the bits stored in the previous period and 9 some external inputs. Some of the output bits of the logical function are forwarded to the exterior, the others are stored in the latches; they will be input to the function in 11 the next period. The begin and end of a period for a certain latch is determined by a periodical clock 13 signal arriving at that latch. The output bit of the function must arrive at this latch before the clock signal, and the propagation of this bit for computation in the next 15 period begins when the clock signal has arrived. The computation is done by a network of logical gates through which signals are 17 propagated. If the clock signals arrive at all latches simultaneously, then the minimum possible cycle time is determined by the slowest computation, i.e. the longest path in 19 the network. Here the length of a path is its propagation delay; this depends, among other things, on the number and type of gates and their positions. In this paper we 21 assume these lengths to be ÿxed (already optimized). Consider the very primitive example of Fig. 1 : we have four latches (A, B, C, 23 D) and seven paths between pairs of latches, each containing one or two gates. In this example there are no external inputs and outputs. For our purposes, the relevant 25 information is shown in Fig. 2 : the latches, the paths, and their lengths (which re ect the propagation time along the paths). From these numbers we see that the cycle time 27 must not be shorter than 14 time units: this is the time the signal from D to A needs. So we have simultaneous clock signals at time 0, 14, 28, 42, and so on. 29
However, if we allow individual clock arrival times at the latches, we can do better: By having clock signal arrival times at latch B one unit earlier and at latch D four 31 units earlier (than at A and C), we can achieve a cycle time of 10: 1
• Clock signal at latch A: 0; 10; 20; : : :
• Clock signal at latch B: −1; 9; 19; : : : 3
• Clock signal at latch C: 0; 10; 20; : : :
• Clock signal at latch D: −4; 6; 16; : : : 5
One easily checks that all data signals arrive in time: for example the path from D to B has length 11, and the allowed time is 9 − (−4) = 13 units. 7 Observe that there exists a circuit in the "latch graph" of Fig. 2 which has mean weight 10. Indeed, the maximum mean weight of a circuit equals the optimum cycle 9 time. So the problem reduces to a maximum mean cycle problem: given a digraph G with 11 edge weights, ÿnd a directed circuit in G whose mean weight (total weight divided by number of edges) is maximum. Karp [6] showed that this problem can be solved 13 in O(nm) time, where n and m denote the number of vertices and edges, respectively. Although this is still the best theoretical bound, other algorithms such as the O(nm + 15 n 2 log n) parametric shortest path algorithm of Young et al. [22] are faster in practice. However, the above solution has a serious disadvantage: four out of seven paths 17 have zero slack: if any of these propagation times is larger than estimated, the chip will not function correctly anymore. To make the design more robust we prefer to have 19 as large slacks as possible on as many paths as possible. For example if we change the clock signal arrival times at latch C to −4; 6; 16; : : : ; the solution remains valid, 21 but the paths incident to C now all have slack at least 4. We of course prefer such a solution 23
There are two ways of formalizing this concept: Let G be a digraph and c : E(G) → R. First, one could consider the vector of all slacks in nondecreasing order. We prefer 25 one solution to another if this vector is lexicographically greater. In other words, we look for a potential : V (G) → R such that the vector of slacks slack(e):= (w) − 27 (v) − c((v; w)) (e = (v; w) ∈ E(G)) in nondecreasing order is lexicographically maximum. 29
ARTICLE IN PRESS
The second way of formalizing the concept is to consider the following necessary 1 condition for a solution to be called optimal: For any latch, and in fact for any group of latches, the minimum slack on an entering edge should equal the minimum slack 3 on a leaving edge:
If this were not true, we could improve the solution by changing the arrival time of 5 all latches in a group X violating (1) by some constant. In fact, the two models are equivalent. If G is strongly connected, (1) is satisÿed by 7 a unique : V (G) → R (up to addition of a constant); see [16] . Hence for this the vector of slacks in nondecreasing order is lexicographically maximum. This problem 9 is known as the minimum balance problem [16, 22] . We shall extend the above observations to a very general model comprising all 11 situations on practical chips. In the general model minimization of the cycle time is not a pure maximum mean cycle problem, but the following more general problem: 13
given a digraph G with edge weights, and a partition of the edge set into red and green edges, ÿnd a directed cycle in G maximizing the total weight divided by the number of 15 red edges. This is a special case of the maximum ratio cycle problem (all edge times are 0 or 1), which can be solved in O(min{n 3 log 2 n; n 3 log n + mn log 2 n log log n}) 17 time [11] . Moreover, the slack balancing problem is not a pure minimum balance problem, but 19 the following (more general) problem: given a digraph G with edge weights, and a partition of the edge set into red and green edges, ÿnd a potential :
that each green edge has nonnegative slack and the vector of slacks of red edges, in nondecreasing order, is lexicographical maximal. 23 We show that these problems (and extensions discussed in Section 6) can also be solved with a parametric shortest path algorithm in O(nm + n 2 log n) time.
25
This paper is organized as follows. After introducing the problem of ÿnding an optimum clocking schedule with a very simple model in Section 2, a general formulation 27 is developed in Section 3. In Section 4, it is shown that the problem reduces to a maximum mean weight cycle problem in a digraph. This can be solved by the algorithm 29 of Young et al. [22] , of which an outline is given. Then we develop an algorithm which takes additional objectives into account. In 31 Section 5 we ÿrst consider slacks on signal paths. We obtain a solution where as few as necessary signal paths are critical. In Section 6 it is shown how this algorithm can 33 be modiÿed to increase the slacks on the clocktree paths. Moreover, slacks on signal paths and clocktree paths can be maximized simultaneously.
35
Computational results with recent IBM processor chips in Section 7 demonstrate the power of our method. The cycle time of the chips is improved by between 2.5 and 37 5.5%. Moreover, the number of critical signal and clocktree paths (with zero or small positive slack) decreases substantially. The running time of the algorithm is reasonable 39 even for latch graphs with several million edges. Let G be the graph with V (G) = P ∪ Q ∪ S whose edges correspond to signal paths: There is an edge (v; w) if the output w of B depends on the input v, i.e. if 25 B(z) w = B(z ) w for some vectors z; z ∈ {0; 1} P∪S with z u = z u for all u ∈ (P ∪ S) \ v. (Subscripts denote component vectors.) 27 To be precise, there might be a path from v to w although w does not depend on v. Such so-called false paths can be ignored if they are detected. However, it is often 29 impossible to detect all false paths (this is coNP-hard problem). For each path from v to w the propagation time of the signal from v to w may 31 vary due to process variations, temperature etc. So we have bounds t min vw and t max vw for the minimum and maximum propagation time over all paths from v to w. Since the 33 network without the latches is acyclic, the graph G and the bounds on the propagation times can be computed by simple forward propagation in topological order.
The chip works correctly if every signal arrives in time:
and no signal arrives too early (i.e. during the previous cycle): 37
Here T is the cycle time and x s (s ∈ S) are variables, while x v := v for v ∈ P ∪ Q are given constants. (2) In future technologies further disadvantages of the zero-skew approach will also 15 become important. If all storage elements switch at the same time, a very large capacitance has to be loaded simultaneously. The e ect is that the supply voltage can 17 uctuate considerably. Moreover, crosstalk on parallel wires, especially in the clocktree, is a more serious problem with the zero-skew approach. Both e ects lead to 19 unpredictable timing behaviour and can cause the chip to fail. How can we choose the numbers x s optimally? Let us ignore (3) for a moment. 21
Then minimization of the cycle time T is quite easy: Contract the set P ∪ Q in G to a special vertex r. In the resulting graph G we deÿne edge weights c((v; w)):=t Proof. Let T and x s (s ∈ S) be a solution of (2); and let C be any circuit in G ; say 29 with vertices v 0 ; v 1 ; : : : ; v k = v 0 in this order. Setting x(r):=0 we have
so the mean weight of C is at most T . 31
On the other hand, if T is the maximum mean weight of a circuit in G , deÿne c (e):=T − c(e). There is no negative circuit in G with respect to c , so we can 33 compute a shortest path potential :
for all latches v we obtain a solution satisfying 35 (2).
C. Albrecht et al. / Discrete Applied Mathematics ( )
So the problem reduces to a maximum mean cycle problem, which is well-solved 1 [6, 22] .
The general problem 3
To formulate the general problem we have to take a closer look at the storage elements. The clock input of a storage element s ∈ S has the value 1 in the time from 5 a s to b s , then again from a s + T to b s + T , from a s + 2T to b s + 2T and so on. In the remaining time it has the value 0. 7
When the clock input has the value 1, the storage element is open, i.e. it stores the value currently seen at the data input. When the clock input has value 0, the stored bit 9 remains unchanged. (Sometimes the roles of 0 and 1 are interchanged, but this does not matter.) At any time, the stored bit is available at the data output for subsequent 11 computations.
To model this situation we need two variables per storage element instead of just 13 one. We may shift the clock input for each s ∈ S by some value y s , meaning that the clock input has the value 1 in the time intervals Since the shifting times y s have to be realized by a clocktree, it is reasonable to impose a lower bound l s and an upper bound u s on y s for each storage element s: 19
Moreover, we have a variable x s for the time when the data signal is valid at the data input of s. Of course, we must have 21
and the data signal should remain valid within the whole interval [x s ; b s + y s ]. For primary inputs and outputs v we set y v = 0 and a v = b v = x v (this value corresponds to 23 v in the previous section). A data signal might encounter more than one storage element per cycle (for example 25 in designs using transparent latches). So for each signal path from v to w it has to be speciÿed whether a signal starting at s within the time interval [a v ; b v ] must arrive 27 before b w (i.e. in the same cycle) or before b w + T (i.e. in the next cycle). In the ÿrst case we set vw :=0, in the second case we set vw :=1. 29
Then the late mode constraints read as follows:
where G is deÿned as in the previous section. For the early mode constraints we have 31 to take the whole intervals into account where the storage elements are open: This describes a quite general model. In the simple model of Section 2 we considered 1 only the case where a s = b s for all s ∈ S and vw = 1 for all (v; w) ∈ E(G). Although it is technically impossible to generate a clock signal which is 1 at a single point of 3 time only, the simple model has some practical relevance: A commonly used storage element is the so-called master-slave latch. Fig. 3 shows two master-slave latches with 5 a signal path in between. A master-slave latch has two clock inputs, one data input and one data output. It works as if it would consist of two simple storage elements, 7
where the data output of the ÿrst one is connected to the data input of the second one. Moreover, the clock input of the second part is roughly the inverse of the clock signal 9 for the ÿrst part (see Fig. 4 ). As long as the ÿrst clock signal C 1 is 1, the value arriving at the data input is stored. 11
If the ÿrst clock signal is 0, it does not change anymore. Now if the second clock signal C 2 is 1, the stored bit is visible at the data output, and the new computations 13 begin. If the falling edge of C 1 arrives at the same time as the rising edge of C 2 , then such a master-slave latch behaves like a simple latch which is open only at one point 15 of time. However, it is also possible to have two independent clock signals arriving at the latch. This may of course lead to better solutions. Moreover, it is possible to have 17 combinational logic (without storage elements) in between; this common technique is known as cycle stealing [14, 19, 10, 9] . 19 In the above general model a master-slave latch can be represented simply by two simple storage elements s and s . Then we shall usually have t min ss = t max ss = 0. This 21 may be used to eliminate the variable x s (Set it to max{x s ; a s + y s }). However, this increases the number of constraints. It is not a priori clear whether this substitu-23 tion is computationally favorable or not; indeed this depends on the structure of the design. 25
In the next section we show how to e ciently solve the above deÿned linear program, i.e. minimize the cycle time T subject to constraints (4), (5), (6) and (7) . After that 27 we shall distribute slacks on signal and clock paths optimally. 
Computing the optimal cycle time 9
Observe that each of the inequality constraints (4), (5), (6) and (7) has one or two x-or y-variables and in addition possibly the special variable T . If a constraint 11 has two x-or y-variables, they have opposite sign. To have exactly two variables per inequality we introduce an artiÿcial variable z 0 (corresponding to r in Section 2) which 13 we assume to have value zero. For technical reasons we substitute := − T and obtain a linear program of the following very special type: 15 max s:t:
where and z 0 ; z 1 ; z 2 ; : : : ; z n are variables, the c ij are constants and E 1 ; E 2 ⊂ {0; : : : ; n}× 17 {0; : : : ; n}. Each constraint (4) corresponds to two elements (i; 0); (0; i) of E 1 , each constraint (5) corresponds to one element of E 1 , and each of the constraints (6) and (7)  19 corresponds to an element of E 1 or E 2 , depending on the -constant. Note that assuming z 0 = 0 causes no loss of generality since adding a constant to 21 all variables z i does not a ect feasibility. Now we translate our linear optimization problem to a network problem. Given the 23 above linear program we construct a directed graph G = (V; E) as follows: For each variable z i there is a vertex v i . For a constraint of type (8) we have a directed edge 25 from vertex v i to vertex v j of cost c ij . For a constraint of type (9) we also have an edge from v i to v j , but the cost is c ij − . Such an edge is called parameterized: the 27 cost of the edge depends on the parameter . As an example, Fig. 5 shows the vertices and edges of G for two master-slave 29 latches with a signal path in between (from left to right). Each master-slave latch consists of two simple latches, and for each simple latch we have an x-variable (on 31 top) and a y-variable. So there are four variables (vertices) for each master-slave latch; the artiÿcial variable z 0 is represented by the vertex at the bottom. There are 33 eight edges for constraints of type (4), eight for type (5), three edges for type (6) and three for type (7) . Three of the edges are parameterized. 35 We are looking for the maximum value for such that values for the variables z i exist which fulÿll all constraints. It is easy to see that such values exist if and only if 37 the digraph G does not contain a directed circuit of negative cost (negative circuit, for 1 short). This is proved similar to Proposition 2.1; see [18, 17, 19, 2] . In fact, our problem can be formulated as follows: given a digraph G with edge weights, and a partition 3 of the edge set into red and green edges, ÿnd a directed circuit in G minimizing the total weight divided by number of red edges. 5
The above LP with constraints (9) and (8) might be infeasible in some cases (if there is a negative circuit consisting of unparameterized edges only). Our algorithm 7 detects infeasibility and returns the negative circuit(s) causing the problem. Usually one can cope with this by omitting some early mode constraints: these can be met 9 by inserting bu ers (increasing the delay of the path). In fact it is usually not good to take all early mode constraints into account because this might increase the opti-11 mum value of the LP (hence the cycle time); one usually prefers inserting a bu er. As we describe in Section 7 we incorporate early mode constraints only after having 13 determined the best possible cycle time. In the following we assume that the LP is feasible. 15
Previous authors solved this LP either by linear programming [4, 14] or by binary search with a subroutine testing for a negative circuit [18, 17, 19, 2, 13] . However, due 17 to its special structure the problem can be solved more e ciently by a direct combinatorial algorithm of Young et al. [22] . We brie y describe their algorithm (which 19 was originally designed for parametric shortest paths) since we shall extend it in Sections 5 and 6. For a detailed description and an e cient implementation see also 21
The algorithm computes a sequence −∞ = 0 6 1 6 · · · 6 k of values for the 1 parameter and a sequence T 1 ; : : : ; T k of shortest paths trees in G from a speciÿed vertex r, the root (in our case we can take the vertex corresponding to the artiÿcial 3 variable z 0 : all vertices are reachable from this vertex), such that T i is a shortest paths tree for all parameters with i−1 6 6 i (i = 1; : : : ; k). The last value k will be 5 the solution of the linear program.
We start by computing a shortest paths tree for = − e∈E |c(e)|. This value of 7
is small enough such that no negative circuit (i.e. directed circuit of negative total cost) exists. The resulting tree is T 1 . 9
Assuming that T i is already computed we show how to compute i and T i+1 . Let P rv be the path form r to v in T i . We check for each edge e=(u; v) whether the path P ru +e 11 contains more parameterized edges than P rv . If so, P ru + e is a potential pivot path and e is a potential pivot edge. For some value e the path P ru + e will be shorter than P rv . 13 i is the minimum value e for all potential pivot edges e. One edge e = (u; v) with the minimum value e becomes the pivot edge. We perform a pivot step by deleting 15 the edge with head v from T i and inserting edge e. The resulting tree is T i+1 . If adding the pivot edge e results in a directed circuit, the algorithm stops. The cost 17 of this directed circuit is zero for e , and e = k is the maximum value of such that G contains no negative circuit. 19 The last tree T k also provides a solution for the variables z i : one can set z i to the cost of the path P rvi in T k for = k . This solution is also called a shortest paths 21 potential. The worst-case running time of this algorithm (with an e cient implementation) is 23 O(nm + n 2 log n) where n = |V (G )| and m = |E(G )|. However, it is much faster in practice as the experimental results will demonstrate. 25 We showed how to determine the clocking schedule with the optimum cycle time. However, the solution obtained so far has a serious drawback. Many inequalities of 27 the linear program (in particular all whose corresponding edges belong to the tree T k ) are satisÿed with equality. If such a tight inequality corresponds to a signal path, then 29 this path will be critical, i.e. the slack is zero. In the next section we show how the slack can be increased for many critical signal paths. In Section 6 we show how to 31 increase slack on clocktree paths optimally. We should note that the above method can also be used for static timing analysis 33 with transparent latches, without changing clock arrival times:
Balancing slacks on signal paths 35
Having computed the optimal cycle time T subject to the constraints described in Section 3, we now increase the slack on the signal paths. This is very important: First 37 of all, at the time when the clock schedule is optimized, the propagation delays can only be estimated. Positive slacks on most paths make the chip less sensitive to routing 39 detours, process variations and manufacturing skew. Moreover, if one tries to optimize the cycle time further (e.g. by di erent logic implementation or placement), only few 41 paths have to be considered. 
The task is to maximize the slack variables max vw and min vw for as many signal paths as possible such that there is a solution for the linear inequalities (4), (5), (6 ) and (7 ) 5 for a given cycle time T .
To make this precise we introduce the following partial order relation on the set of 7 all solutions:
Deÿnition 5.1. Let ( 1 ; : : : ; k ) and ( 1 ; : : : ; k ) be two vectors (of slack variables). Let 9 be a permutation on {1; : : : ; k} such that (1) 6 (2) 6 · · · 6 (k) and be a permutation such that (1) 6 (2) 6 · · · 6 (k) . We say that solution ( 1 ; : : : ; k ) is better 11 than solution ( 1 ; : : : ; k ) if ( (1) ; : : : ; (k) ) is greater than ( (1) ; : : : ; (k) ) in lexicographic order; i.e. if there exists an l; 1 6 l 6 k; such that (i) = (i) for i ¡ l and 13
The slack balancing problem consists of ÿnding a solution of (4), (5), (6 ) and (7 ) 15 such that the vector of all slack variables is best possible.
In other words, we look for a solution of (4), (5), (6 ) and (7 ) such that the vector 17 of all slack variables in nondecreasing order is lexicographically maximal.
The following example illustrates this deÿnition: Suppose we have four signal paths, 19 one solution with late-mode slacks ( max vw ) 0; 1; 3; 0 and early-mode slacks ( min vw ) 2; 3; 1; 0 and another solution with late-mode slacks 0; 2; 2; 0 and early-mode slacks 5; 2; 1; 3. We 21 sort both solutions with increasing slack regardless of the slack being for late mode or early mode: (0; 0; 0; 1; 1; 2; 3; 3) for the ÿrst solution and (0; 0; 1; 2; 2; 2; 3; 5) for the 23 second solution. The second solution is better. It will be shown that any optimum solution for the slack balancing problem has 25 the same vector of slack variables. We now describe an algorithm which ÿnds this solutions. It proceeds as follows: 27
The slacks of all signal paths are increased simultaneously until they cannot be increased anymore, i.e. some constraints, which form a directed circuit, are already 29 tight. Then we take the subset of all signal paths on which the slack can still be increased and continue to increase the slack on these signal paths. 31
The same digraph G = (V; E) as in Section 4 is constructed, but the costs are di erent. The optimal cycle time T is already computed and should not change, it 33 becomes part of the cost of the respective edges. The parameterized edges are now those which correspond to constraints with max vw or 35 min vw (i.e. (6 ) and (7 )), all other edges are not parameterized. The parameter now represents the slack of all signal paths. 37
We ÿrst compute again the maximum value such that G contains no negative 1 circuit, using the parametric shortest path algorithm described in Section 4. This value is zero (if T was the optimal cycle time), and so is the slack of all signal paths for 3 which the corresponding edges belong to the zero cost directed circuit C found by the algorithm. Increasing the parameter of any of the parameterized edges on C is 5 impossible, because it would result in a negative circuit. All edges on C lose their parameter, only the parameter of all other edges is increased. 7
C is contracted, and the costs of the edges leaving and entering C are adjusted: Let z be the vertex to which C is contracted, and let w be the vertex of C nearest to the 9 root r in the last tree T k computed. For a vertex v of C denote by c(P wv ) the cost of the path from w to v in T k for parameter k . 11
For an edge e=(v; u) leaving C, i.e. v belongs to C but u does not, the corresponding new edge e =(z; u) after the contraction gets the cost c(e )=c(e)+c(P wv ). For an edge 13 e = (u; v) entering C, the corresponding new edge gets the cost c(e ) = c(e) − c(P wv ).
The algorithm continues to increase the parameter and to change the tree such that 15 it remains a shortest paths tree until the next directed circuit of zero cost is found. The value of the parameter at this state is again the slack of the signal paths for which 17 the corresponding edges belong to the directed circuit.
We can now prove that the solution computed by this algorithm is the optimum 19 solution with respect to Deÿnition 5.1:
Theorem 5.2. The algorithm described above ÿnds an optimum solution to the slack 21 balancing problem. Moreover; any optimum solution for the slack balancing problem has the same vector of slack variables. 23
Proof. Whenever the algorithm ÿnds a directed circuit of zero cost; the parameters of all parameterized edges of the circuit have the same value and so all the slacks of the 25 corresponding signal paths are equal. Increasing the slack of one of these signal paths is only possible by decreasing the slack of another signal path; but this would result 27 in a solution which is worse.
The algorithm presented here is a modiÿcation of the algorithm for the minimum 29 balance problem described by Young et al. [22] . For the minimum balance problem all edges of the directed graph G are parameterized. Here only some edges are pa-31 rameterized, namely those edges which correspond to constraints of signal paths with variables
This is also the reason why we speak of balancing the slacks: the slacks are increased and distributed "equally" on the signal paths. 35 With an e cient implementation the worst-case running time of the algorithm described above is O(nm + n 2 log n).
37
But note that it is not necessary to run the algorithm to the very end. The algorithm can be stopped at any time, e.g. when the certain value of the parameter is reached. 39
Then the slack of the signal paths are only increased up to this value. Since slacks exceeding a certain amount are usually not interesting this option is used in practice; 41 see Section 7. times are much easier to realize and a smaller cycle time can be achieved (see also [13] ). 7
Clocktrees with prescribed skews can be designed by basically the same algorithm as zero-skew clocktrees. Although this can be done quite e ciently, prescribed skews 9 (zero or not) make detours in the clocktree wiring necessary. If one has intervals for the arrival times of the clock signals one can design clocktrees with signiÿcantly smaller 11 wirelength; see e.g. [5] . We show how to achieve large intervals for as many latches as possible. 
The problem is solved similarly to the slack balancing problem for signal paths (Section 5), but some modiÿcations are needed since inequality (7 ) contains the slack variables 21 v and w of two di erent latches v and w which are connected by a signal path.
For each constraint (7 ) we introduce an additional variable m vw and split the in-23 equality into two:
Now each inequality contains at most one slack variable s , and if it contains the 25 variable s , then it contains also the corresponding variable y s . As before we construct a digraph G with a vertex for each variable and an edge for 27 each inequality. An edge is parameterized if and only if the corresponding inequality. An edge is parameterized if and only if the corresponding inequality contains a slack 29 variable s . But in contrast to the problem of balancing slacks on signal paths (described in Section 5) we now have several inequalities with the same slack variable s . 31
We use a similar algorithm to that of the previous section. Now it might happen that when a directed zero cost circuit C is found and contracted there is an edge e on 33 C and an edge f not belonging to C with the same slack variable. In this case the value of this slack variable is already determined; the parameter of f must not increase 35 anymore. Edge f (and other edges with the same slack variable) get the cost which they have with the current value of the parameter, they are no longer parameterized. 37
Observe that all edges which lose their parameter have at least one vertex belonging 1 to C, namely the vertex corresponding to the variable y s . The cost of these edges are adjusted with the contraction of the directed circuit, 3
hence the running time of the algorithm does not change. In Section 5 we have balanced the slacks on the signal paths, in this section those 5 on clocktree paths. We ÿnally show that it is also possible to balance the slacks on signal paths and clocktree paths simultaneously. Constraint (7) is substituted by 7
We look for a solution of constraints (4), (5 ), (6 ) and (7 ). For each constraint (7 ) we introduce two additional variables m vw and m vw and replace the constraint by 9 the following three inequalities:
Then we apply the same algorithm as above. We obtain: 11 Theorem 6.1. The slack balancing problem for signal and clocktree paths (constraints (4); (5 ); (6 ) and (7 )) has a unique optimum solution which can be computed in 13
time O(nm + n 2 log n) where n is the number of primary inputs; primary outputs and latches and m is the number of signal paths. 15
Proof. It can be derived in the same way as in the proof of Theorem 5.2 that the optimum solution is unique and that the algorithm ÿnds it. 17 In order to see that the running time is of the given order, observe that the new vertices added for the constraints have only one incoming edge. Such an edge can never 19 be exchanged during the parametric shortest path algorithm unless it is contracted. By the results of Young et al. [22] it is su cient to show that the total number of 21 pivot steps is O(n 2 ). During the algorithm let X be the set of all original vertices in G (not those resulting from subdividing edges) plus all vertices which have emerged 23 by contraction. For each vertex v ∈ X consider '(v) = 5|X | − (v), where (v) is the number of parameterized edges on P rv . '(v) is positive and bounded by O(n) as is 25 |X |. At each pivot step there is at least one vertex v ∈ X for which (v) increases, hence 27 v∈X '(v) strictly decreases. Finally, we show that v∈X '(v) does not increase due to contraction. If a directed 29 circuit is contracted, then (v) can decrease by at most two (for incoming and leaving edges of the directed circuit) plus three times the number by which |X | decreases (the 31 number of vertices in X on the directed circuit minus one). This is compensated by the term 5|X | in '(v). Moreover, for each new vertex z which enters X by the contraction 33 there is at least one vertex v with '(z) 6 '(v) which leaves X . Hence the expression v∈X '(v) never increases and strictly decreases with each 35 pivot step. Since it is O(n 2 ) initially and nonnegative throughout, the number of pivot steps is O(n 2 ). 37
ARTICLE IN PRESS

Computational results 1
We have implemented the algorithm in C, all runs are on an IBM RISC System=6000 Model 595. Our algorithm has been applied, among others, to the G3 series of IBM 3 S=390 processor chips (L2 and PU) and the latest follow ups (MBA). For details of the design system see Koehl et al. [8] and Kick et al. [7] . Table 1 shows the di erent  5 chips with target cycle time, number of circuits, nets, pins and primary inputs and outputs (IOs, some of which are bidirectional) and the number of signal paths. See 7 also Fig. 10 for a placement of the MBA chip. In addition to the constraints described so far further technical restrictions had to be 1 taken into account. For example, for some master-slave latches it is required that the data signal arrives at the latch before the rising time of the clock signal of the slave 3 latch (end-of-cycle test). In this case one has a constraint of the form vertices and edges of the graph G in Table 2 to the numbers which one would expect 1 by Table 1 . Our program consists of two main parts: ÿrst the constraints are generated by simple 3 forward propagation for late mode and early mode constraints. During the propagation we store at each circuit the set of all primary inputs and latches from which this 5 circuit can be reached, along with the maximum propagation delay for late mode, resp. the minimum propagation delay for early mode. Table 3 shows the running times for 7 the generation of all constraints for the three di erent chips. A detailed description of the timing analysis program can be found in Schietke [15] . The running times 9 for simple propagation (computation of arrival times and slacks only) are shown for comparison. 11 The second step consists of the main optimization algorithm. Rather than increasing 1 the slacks on late mode and early mode constraints simultaneously, we ÿrst increase the slacks on late mode constraints up to a certain value late while ignoring all early 3 mode constraints. Then we add all early mode constraints and increase the slack on these constraints up to early . For this the late mode constraints are added with unpa-5 rameterized edges, such that the slack of late mode constraints with slack smaller than late does not decrease and the slack of all other late mode constraints remains at least 7 late . The reason for treating late and early mode constraints di erently is that early 1 mode problems can usually be ÿxed quite e ciently by inserting a bu er (=2 inverters). Finally, we balance the slacks on clocktree paths up to a value clock as described 3 in Section 6. Again it is assured that the slacks of late and early mode constraints do not decrease below the value to which they were optimized. 5
Obviously, the result of the optimization depends on the length of the interval given by l s and u s specifying by how much the clock signal arrival time can be shifted from 7 Table 4 shows the worst slack of all late mode constraints before and after optimiza-1 tion with respect to the target cycle time mentioned in Table 1 . This means that the L2 could run with a cycle time of 6:5 ns + 0:048 ns = 6:548 ns before the optimization 3 and with a cycle time of 6:5 ns − 0:313 ns = 6:187 ns after optimization: the cycle time was improved by 5.5%. Similarly, the cycle time of the PU was improved by 2.5% 5 and the cycle time of the MBA by 3.7%. Table 4 also shows the number of signal paths with late mode slack smaller than 7 −0:2 ns; −0:1 ns; : : : . This demonstrates how dramatically the number of critical paths decreases. 9 In Table 5 the running times for the algorithm for the three chips for computing the 1 optimal cycle time and for balancing slacks for di erent scenarios with late , early and clock are shown. Again, all late mode slacks are taken with respect to the target cycle 3
times. Fig. 6 shows a frequency distribution of the slacks of all late mode constraints for 5 the MBA for the case that no optimization is possible (i.e. l s = u s = 0:0 ns for all latches s). The ÿrst column shows the di erent intervals, the second column gives 7 the number of signal paths whose slack is within the interval and the third column gives a graphical representation of this number by a proportional number of stars. For 9 example, it can be read from Fig. 6 that the MBA without optimization has 101 signal paths with −0:100 ns 6 max vw ¡ − 0:050 ns. Fig. 7 shows the corresponding frequency 11 distributions of late mode constraints after optimization for = 0:2 ns. Figs. 8 and 9 show the same for all early mode constraints. Increasing the slacks on 13 late mode constraints makes the worst slack of all early mode constraints worse, but nevertheless the total number of early mode constraints with negative slack decreases 15 considerably. Fig. 11a and b shows the e ect of slack balancing at a glance. Each line connects 17 the endpoints of a critical path, with respect to the placement shown in Fig. 10 . The colours have the following meaning: 19
• red lines represent signal paths with a negative late mode slack;
• yellow lines represent signal paths with a late mode slack between 0.0 and 0:2 ns; 21
• blue lines represent signal paths with a negative early mode slack;
• green lines represent signal paths with an early mode slack between 0.0 and 0:2 ns. 23
The left-hand side is the situation before optimization, the right-hand side shows that after optimization only very few critical areas remain. Fig. 13a and b is the analogous 25 picture for the PU, with respect to the placement shown in Fig. 12 shown. Even though the slacks on late mode and early mode constraints have already been increased up to 0:2 ns, the clock signal for most of the latches still does not have 3 to arrive at exactly the prescribed time.
8. Uncited reference 5 [3] 
