The applicarion of general clock skew scheduling is practically limired due ro the digFculriei in implemnring a wide specrrwn of dedicared clock delays in a reliable manner. n i s results in a significant limirarion of the oprimizarion potenrial. As an a l r e mrive, rhe applicarion of mulriple clocking domains with dedicared phase shifrs rhar are implemenred by reliable, possibly expensive design srrucrures can overcome rhese limirarions and subsranrially increase rhe implemenrable oprimizarion porenriol of clock ndjusrmenrs. In rhis pipc'per we presgnf an algorirhm for consrrained clock skew scheduling which compurer for a given number of clocking domains the optimal phase shifs for rhe domains and the assignmenr of the individual regisrers Io rhe domains For the wirhindomain latency values, rhe algorithm CM asswne a zero-skew clock delivery or apply a user-provided upper bound. Our experiments demonstrate rhar a consrrained clock skew schedule using a few clockrng domains combined with small wirhin-domain latency can reliably implemnr rhefull sequential oprimizarion porenrial ro dare only possible with a n unconstrained clock schedule.
realize several clocking frequencies and also to address specific timing requirements. For example, a special clocking domain that delivers a phase-sbifted clock signal to the registers close to the chip inputs and outputs is regularly used to acheve timing closure for pons with extreme constraints on their arrival and required times. In principle, a multi-domain approach could also be used to realize larger clock latency variations for all registers. In combination with a within-domain clock skew scheduling algorithm, they could implement an aggressive sequential optimization that would be impractical with individual delays of register clocks. The motivation behind this approach is based on the fact that large phase shifts between clocking domains can be implemented reliably by using dedicated, possibly expensive circuit components such as "structured clock buffers" [4], adjustments to the PLL circuitry, or simply by deriving the set of phase-shifted domains from a higher frequency clock using different tapping points of a shift register.
In our terminology, we use the term clock larency of a register to denote its clock arrival time relative to a common origin of time. Note that the origin can be chosen arbitrarily: different origins simply correspond to different offsets added to all register latencies. Clock skew refers to the relative difference of the clock latencies of registers. We use the term clock phase shifr of ~2 domain to denote an offset of the latency common to all registers of that domain. The within-domain Iarency is defined as the difference between the clock latency of a register and the phase shift of its domain. Thus a zero within-domain latency means that all register latencies of a domain are equal to the phase shift of the domain.
In current design methodologies, the specification of multiple clocking domains is mostly done manually as no design automation support is available. In this paper we present an algorithm for constrained clock skew scheduling which computes for a usergiven number of clocking domains the optimal phase shifts for the domain clocks and the assignment of the circuit registers to the domains. For the clock distribution witbin a domain, the algorithm can assume a zero-skew clock delivery or apply auser-provided u p per bound for the within-domain latency. Our experiments demonstrate that a clock skew schedule using a few domains combined with a small within-domain latency can reliably implement the full optimization potential of an unconstrained clock schedule.
Our algorithm is based on a branch-and-bound search for the assignment of registers to clocking domains. We apply a satisfiability (SAT) solver based on a problem encoding in conjunctive normal form (CNR to efficiently drive the search and compactly record parts of the solution space that are guaranteed to contain no solutions better than the current one. The combination of a modem SAT solver [5] with an underlying orthogonal optimization problem provides a powerful mechanism for a hybrid search that has significant potential for other applications in many domains.
For simplicity, our description will be based on circuits which have initially a single clocking domain and include only registers that are triggered at the same clock edge. However. all presented concepts can be extended to more general cases including circuits which have initially multiple, possibly uncorrelated clocking domains and also include level-sensitive latches.
Unconstrained Clock Skew Scheduling
In this section, we revisit the algorithmic base for unconstrained clock skew scheduling which is extended to the constrained case in the following section. Given a sequential circuit, the objective of generic clock skew scheduling is to determine an assignment of latencies to registers in order to minimize the clock period, while avoiding clocking hazards [I] .
Let G = ( V , E S L u p , E h o l d ) denote the timing graph for a sequential circuit. The set of vertices V corresponds to the registers in the circuit and includes a single vertex for all circuit ports. The sets Esaup V x V and Ehnld C V x V denote the setup edges and hold edges, respectively. Esaup contains for each set of combinational circuit paths between registers (or a port) U and v a directed edgee= (U,.) with weight w(u,v) = ~y~~y , r , -d m n r ( u , v ) -d s r t u p ( v ) , where d,, (u,u) represents the longest combinational delay among all paths between U and v, dsetup(u) denotes the setup time at v , and T,,l, is the cycle period.
consists of a set of reversed edges
where dmI,(u,v) is the shortest combinational delay among all paths between U and v and dhoid(v) denotes the hold time at v. By construction G is strongly connected and contains at least one setup edge. We assume that all weights of hold edges are nonnegative, i.e., Ve E Ehoid : w ( e ) > 0. T h s restriction simplifies the presentation, however, all algorithms can he extended easily for a relaxed condition that just prohibits negative hold time cycles.
Let 1 : V + W assign a clock latency to each register and E = Esa., U Ehold. We want to determine an optimal clock skew schedule I(")," E V such that:
Tycie +fin
The computed values 1 give for each register the additional delay (or advance if l < 0) of its clock signal such that the circuit can be clocked with the minimum cycle period cycle. Note that condition ( I ) ensures that the setup and hold constraints are satisfied as modeled by the edges EJlUp and Ehoid, respectively. Figure I(a) gives an example of a circuit; the corresponding timing graph is given in p a t (b). The sehlp and hold times of registers and ports are assumed to be 0. The solid and dashed arcs correspond to the setup edges Esorup and hold edges Computation of the optimal clock schedule is closely related to detection of the critical cycle which is the structural cycle with the maximum value for totnl-delaylnum-registers (ignoring hold edges). Detecting the critical cycle is equivalent to computing the maximum mean cycle (MMC) of a weighted cyclic graph. Our approach is mainly based on Bum's work [6. pp. 42-56), which is to our knowledge one of the fastest practical algorithms for the MMC computation.
Algorithm 1 describes an adaptation of Bum's iterative M M C computation for the given problem. The basic idea is to iteratively decrease T&te and compute the corresponding clock schedule 1 at each step until a critical cycle is discovered. First, the algorithm initializes the schedule with all latencies set to 0 and cYclycle to the respectively. 
Similar to the unconstrained case, constraint (2) ensures that all setup and hold time constraints are satisfied and funhermore that all registers assigned to a domain do not exceed the specified maximum within-domain latency. Condition (3) specifies that each register has to be assigned to exactly one domain.
Base Algorithm
Tne problem formulation for constrained clock skew scheduling presented in the previous section establishes a Mixed Integer Linear Program (MILP). Unfortunately, the size of practical problem instances involving thousands of registers makes their solution intractable for generic MILP solvers.
Our objective is to efficiently solve the constrained clock skew scheduling problem for a smdl number of domains. We use a hybrid approach combining a CNF-based SAT solver with a modified version of the scheduling algorithm used in the unconstrained case. We use &e SAT solver for enumerating the assignments of registers to domains based on the presented encoding with the Boolean variables x. Boolean constraints are applied to restrict the search to valid assignments according to condition (3) and to incrementally record pans of the solution space that do not contain solutions that are better than the best found thus far. This recording is done by adding conflict clauses to the SAT problem which prevent the solver from revisiting symmetric parts of the solution space.
The basic flow of OUT approach is shown in Algorithm 2. After initialization on lines 1 and 2, an empty CNF formula @ is created with a set of variables for the registers and clocking domains.
The procedure INITIALCONSTRAINTS then adds an initial set of Boolean constraints to @ that encode valid register-to-domain assignments and represent necessary conditions for the optimization problem. Next the SAT solver is called iteratively to find a complete satisfying assigmnentxs,q with respect to 0. For each generated satisfying assignment, one of the following applies: (1) if the minimum possible period for the configuration is greater than the current best value for Tcycl,, then this can be detected by a negative cycle in the graph configured by *SAT, or (2) if there are no negative cycles, then &le can he further improved using Bum's algorithm In the fust case the procedure NEGCYCLECONSTRAINTS leans the negative cycles by adding corresponding CNF constraints to @. In the second case the modified critical cycle analysis shown in Algorithm 3 is invoked to further improve Tqcle until a tighter critical cycle is reached. Following this optimization step, the procedure TIGHTEN~NGCONSTRAINTS adds a set of new CNF constraints to @ which encode the critical cycles in G and other conditions that are necessary for improving the solution.
The negative and critical cycle constraints jointly ensure that no configuration with previously encountered cycles is revisited. The iteration between the SAT solver and the critical cycle analysis is continued until no new solution can be found. At this point. the values for the last Tcy.l. and I presents the optimal solution for the constrained clock skew scheduling problem. To simplify the presentation of the algorithmic flow, we show all register latencies initialized to 0 and T,,l, is set to the maxmum combinational delay each time Algorithm 3 is invoked. This ensues a valid starting point for Bum's algorithm. Furthermore, CONDITIONALSCHEDULE is only applied if G does not contain any negative cycle for the current cycle -thus it'is guaranteed that a schedule with an equal or smaller value for cycle can he found.
In the actual implementation, the detection of negative cycles on line 8 of Algorithm 2 and the computation of valid register latencies for the given hest Tcycic is combined using a single analysis run. This provides a good starting point for tightening the critical cycle and thus avoids unnecessary iterations of Burn's algorithm.
Algorithm INITALCONSTRAINTS
There are two sets of initial constraints for the SAT solver. The first set ensures that each register is assigned to exactly one domain. This is encoded by the following set of CNF clauses for all v E V a n d a l l d i , d j E D , i # j :
To avoid visiting symmetric domain assignments, one can either encode a corresponding set of CNF constraints that exclude these cases, or define a total ordering of the phase shifts of the individual domains such that:
In our approach we chose the latter method which can he enforced by adding an edge (dj,di+l) to the timing graph with weight w(d,,dl+l) = 0. Algorithm 4 summarizes the generation of initial constraints.
In the actual implementation, the edge weights are set to a slightly tighter value w(di,di+l) = -6 excluding "overlapping" solutions which can occur due to the within-domain latency of up to 6. However, using negative weights for the domain-to-domain edges requires special care for the initialization of the schedule 1 for Bum's algorithm.
Algorithm NEGCYCLECONSTRAINTS
Algorithm NEGCYCLECONSTRAINTS is invoked if the graph currently configured cannot implement the hest cycle time cycle found thus far. This situation is detected by finding a cycle in C that contains at least one setup edge and has a non-positive cycle weight. Clearly, any such cycle must contain at least one pair of "active" conditional edges from E n d . This is because a negative cycle just consisting of edges from Eretup U Ehnid constrains the minimum value of cy,,, independently of the domain assignment and hence would have been detected earlier.
The negative or zero weighted cycles are encoded as CNF conflict clauses and added to 9. For example, if a cycle contains the two conditional edges (vl,dl) and (vz.dz), the clause x(v1,dl) vx(vz,dz) is added which ensures that in the future both edges are not activated at the same time. Since the number of cycles is generally exponential, our implementation uses a greedy heuristic which encounters all cycles up to four conditional edges. Our experiments show that this scheme provides an efficient means to keep the number of learned clauses small and at the same time ensure quick convergence. Algorithm 5 summarizes the leaming of negative cycle constraints.
Algorithm TIGHTENINGCONSTRAINTS
If no negative cycles are encountered algorithm CONDITION-ALSCHEDULE is invoked to improve the clock period Tqc~, and calculate a corresponding schedule 1. After this computation, a set of constraints encoding the zero-weight critical cycles are added which prevent revisiting a configuration with an identical critical cycle. Because of the assumed ordering of domains this inequality can be learned through the following set of clauses generated for all d;,dj E D : i < j :
These clauses effectively capture the constraint that any satisfying configuration XSAT can only allow assignments i(u,dj) and x(v,dj) where i 5 j . The condition can be applied more generally by including any path from U to v formed by edges of $,,., UEhold with negative path weight. When T,,, is decreased, all edges in E,,.p decrease in weight. The precedence constraint can then be implied on a subset of paths in G = (V, Esaup, Ehold) whose weights become negative. Again, overlapping solutions can be avoided by tightening these constraints by the sum of the bounds on the within-domain latencies. For an efficient generation of precedence constraints, an incremental All-Pairs-Shonesr-Pnrh algorithm [7] is used to update the shortest path delays between any pair of nodes in G whenever 7&, is improved. Figure 2 shows the multi-domain timing graphs for two configurations for the example of Figure 1 with two clocking domains and within-domain latency 6 = 0. The minimum period with two domains (7&le = 7) is achieved by configuration (b). Note that with three domains the minimum clock period is 6. which is just the solution for the unconstrained case as derived in Figure 1 . Indeed, the optimum clock period achieved in the unconstrained case provides a lower bound for the optimum period when the number of domains is constrained.
Example
For the constrained clock skew scheduling example in Figure 2 , there are at most lDllvl = 2' = 32 different configurations to explore in order to compute the smallest period with two domains. The key for efficiently pruning the search is based on the observation that the period of a particular configuration is limited only by the subset of the register-domain assignments that correspond to critical cycles in the timing graph. For example, after the SAT solver generates the configuration in Figure 2(a) . we can avoid any other configuration with either the assignments x(vtrdl) = x(vz,dl) = 1 or x(v3,dz) =x(v4,dz) = I , since the corresponding critical cycles always limit Tqclr to 8. Ths is encoded by adding the following ~_ _ twoCNFconRictclauses to $: (x(vt,dl)Vx(v~,dt)) and ( 4~3~4 ) V x ( v 4 , 4 ) ) .
When the configuration in Figure 2(b) is visited, 7&, is updated to 7 and the corresponding critical cycles are learned. In this manner, the algorithm continuously generates valid configurations, prunes the remaining search space by learning critical cycles, and improves T,,J, until the SAT solver is unable to find another satisfying register-domain assignment. 
Further Algorithmic Improvements
The base algorithm works efficiently for larger circuits up to three clocking domains. However, in the case of more clocking domains, the exponential nature of the problem may cause long runtimes. Note that the search can he interrupted at any point; all encountered solutions are valid; thus the last one can serve as suboptimal schedule.
We observed that the runtime can be reduced significantly when the search is composed of the following three phases: ( I ) initial estimation of a good solution based on binning of the unconstrained clock schedule, (2) gradual improvement of this solution based on a limited search space that preserves the ordering of the unconstrained schedule and (3) final full search with temporary limitation removed. When artificially over-constraining the search during the first two phases, the solver converged significantly faster. Furthermore, many negative cycle and tightening constraints can be added for the final full search which in tum improves its run time. Algorithm 7 gives an overview of this refined algorithm: the following sections elaborate on the details of the first two phases. FULLSOLUTION (C,T,,,,) 5 return l,Tcvc~e
Initial Solution
A simple approach to derive a good initial value for &. is to solve the unconstrained clock skew scheduling problem for G using Algorithm 1 and then distribute the resulting latencies greedily into ID1 bins of size w, where I , , and I , , represent the maximum and minimum latency of the unconstrained schedule, respectively. The actual clock period for t h s solution is computed by translating the latency binning into corresponding register-domain edges in C followed by single run of Algorithm 3.
Furthermore, the best solution for ID1 -1 domains provides an upper bound for ?&I, with ID1 domains. Since the algorithm mns significantly faster for fewer clocking domains, a previously computed solution for fewer domains can be used as an alternative starting point if it's value for Tcyclr is smaller than the one from binning.
Partial Ordering Heuristic
After the initialization step. we can introduce a set of partial ordering constraints on the domain assignments of registers. The partial ordering helps in trimming the search space, but may in tum also exclude the optimum solution. The heuristic assumes that if in the unconstrained skew schedule register U has a latency greater than that of register Y , then there exists an optimum constrained skew schedule that has U assigned to a domain equal to or higher than v. The constraint generation for this heuristic is detailed in Algorithm 8. The SAT-based search is then applied to this overconstrained problem. The resulting clock period is a good starting point for the final run of the solver to compute the exact optimum.
The partial ordering heuristic appears to be exact for small circuits; however, one can show that the ordering constraints may exclude better solutions as illustrated by a simple counter-example given in Figure 3 . For this graph, the optimum cycle is 4 for the unconstrained case. The latencies at each vertex to achieve this period are shown in the figure. Note that the constrained version of the problem will require at least 8 clocking domains to achieve this period.
Let d, denote the domain that vertex v is assigned to. Allowing only two clocking domains and zero within-domain latency (i.e.. . They extend the basic framework to balance slacks on all circuit paths in order to restrict uncertainties in the implementation of delay buffers and clock tree synthesis. To our knowledge, the practically fastest algorithm applicable for unconstrained 'clock skew scheduling has been published by Bums [6] . Here, the computation of the optimal clock schedule is related to the detection of the critical cycle. Bum's algorithm provides a fast method to identify the critical cycle and mstribute the smallest amount of latency necessary for the registers to attain the optimal cycle period. It is this algorithm that we use in the inner loop of our approach for constrained clock skew scheduling.
There is no work that proposes a solution to the constrained clock skew scheduling problem that is considered in this paper. on generating a feasible clock tree as opposed to finding an optimal solution for the cycle time in the clock skew scheduling problem. The work that comes closest to our problem of constrained clock skew scheduling was published by Singh and Brown [I'll. The authors consider the problem of clock skew scheduling using a fixed, small set of clocking domains with pre-determined phase shifts to he implemented in FPGA's. The solution is a slight modification of the unconstrained clock skew scheduling method and uses an iterative Bellman-Ford algorithm. However, the quality of their results is only as good as the predefined phase shift values. In contrast, our work considers the more general problem of multi-domain clock skew scheduling where the phase shifts of the domain can be modified for optimal performance.
Experimental Results
To evaluate the algorithm and observe its performance on practical designs, we have created a prototype implementation using the presented methods on top of the SAT solver Chaff [SI. Our benchmark suite consisted of the 31 ISCAS89 sequential circuits and 8 industrial designs. The ISCAS benchmarks were technology mapped through SIS [I31 using the library lib2.genlib. The industrial circuits were generated by a commercial logic synthesis tool using industrial ASIC libraries. We applied the REFINEDCON-STRAINEDSKEWSCHEDULING algorithm to determine the minimum feasible clock period with up to four clocking domains and a within-domain latency of up to 10% of the initial cycle period corresponding to the longest combinational delay including setup time. The experiments were conducted on a Pentiumlll 2GHz processor with 2GB RAM running Linux. The results are reported in Tables 1 and 2 . Table 3 presents the run times and the number of SAT solver iterations for the industrial circuits.
Columns 2 and 3 in Tables 1 and 2 give the number of vertices and edges in the timing graph. Column 4 reports the optimal clock period TcTC,. achievable through clock skew scheduling with an unconstrained number of domains. This is a lower bound. Column 5 shows the initial cycle time for the circuit corresponding to a zeroskew schedule which is simply the longest Combinational path delay. This is an upper hound and corresponds to a configuration with one domain and zero within-domain latency, denoted as e;:<. The subsequent columns report the optimum clock period computed by our algorithm for a bounded number of domains and within-domain latency of 0%, 5%. and 10% of 7$<. The numbers reported in a column with a label of cy$e indicate the optimum cycle time forx clock domains and a within-domain latency of 6 = y% . c>: . . We hghlighted all dominating solutions, i.e., the non-bold entries reflect solutions for which ther; exist an equivalent or better one with fewer domains or a smaller value for the within-domain latency.
The algorithm easily optimized all ISCAS benchmarks -for a majority of instances. the optimum was achieved with less than three domains. The total run time on the first 21 ISCAS benchmarks was less than a minute. The last four circuits took only slightly longer. The results reported in Table 2 indicate a considerable cycle time improvement in most of the industrial circuits. Even with two domains and a within-domain latency of S = 5% x c.&. the industrial benchmarks achieved on average 90% of the optimum cycle time (T,) possible. With three domains and 5% x T;$ latency, these benchmarks come as close as 95% of the optimum solution. In fact, for six of the eight industrial benchmarks, we achieve the lowest clock period possible through clock skew scheduling with four domains; four among these reached the optimum with three domains. The run times were reasonable, given the high complexity of the problem. For design D2. with four domains and no withn-domain skew, we terminated the algorithm after 20 hours; it had achieved a cycle time of 15.89 as shown. Wc re-ran that case with a tight initial guess (from a previous run) and the algorithm terminated in 17 hours with the optimum cycle time, which forthatcaseis 15.41. Figure 4 tracks the progress of the three phases of the algorithm over time for seven industrial designs constrained by four clocking domains and zero within-domain latency. Circuit D4 is not included because the optimum period was trivially computed and there was no iterative improvement. The execution time and clock period have been normalized 100% corresponds to the clock 
a-',

Conclusions
In this paper we presented a? algorithm for constrained clock skew scheduling which computes for a fixed number of clocking domains the optimal phase shifts for the domains and the assignment of the individual registers to the domains. For the withindomain latency values, the algorithm can assume a zero-skew clock delivery or apply a user-provided upper bound. Our algorithm is based on a branch-and-bound enumeration of the register-todomain assignments. We apply a CNF-based SAT solver for the enumeration process and use learning of CNF constraints to prevent invalid register assignments and to record sets of inferior solutions which should not he revisited. The actual evaluation of each assignment is performed by an incremental maximum mean cycle analysis on the constraint graph.
Our experiments indicate, that despite the potential complexity of the enumeration process, the presented algorithm is efficient for modestly sized circuits and works even for circuits with several thousand registers reasonably fast. Furthermore, our results show that a constrained clock skew schedule with few clocking domains and zero or 5% within-domain latency can in most cases achieve the optimal cycle time dictated by the critical cycle of the circuit. The resulting multi-domain solution provides a significant advantage over the corresponding unconstrained clock skew schedule which typically has large variations of register latencies and thus cannot be implemented in a reliable manner. Table 2 .
