It is known that the clock skew can be exploited as a manageable resource to improve the circuit performance. However, due to the limitation of race condition. the optimal clock skew scheduling does not achieve the lower bound of the clock period. In this paper, we propose a polynomial time complexity algorithm, which incorporates optimal clock skew scheduling and delay insertion, for the synthesis of non-zero clock skew circuits. The main advantages of our algorithm include two parts. First, it guarantees to achieve the lower bound of the clock period. Secondly, it also tries to minimize the required inserted delays under the lower bound of the clock period. Experimental data shows that, even though we only use the buffers in a standard cell library to implement the delay insertion, our approach still works well.
INTRODUCTION
Although the optimal cluck skew scheduling [1,2] can enhance the circuit perforniance, it does not achieve the lower bound of the clock period. In fact, the maximum delay-to-register ratio of the cycles in the circuit gives a lower bound of the clock period. No sequential timing optimization technique c m derive a sequential circuit whose clock period works less than this lower bound.
Previous works [3, 4] have tried to improve the optimal clock skew scheduling by combining with retiming transformation. However, retiming transformation is likely to have the following effects in the design flow: (1) an extra circuitry may be needed for initial states; (2) the forward retiming may increase the number of registers; and (3) verification issues. This paper investigates the clock period minimization problem from a different standpoint. The main distinction of our approach is that it uses delay insenion instead of retiming transformation. The proposed algorithm guarantees to obtain a non-zero clock skew circuit, which.works with the lower bound of the clock period, in polynomial time complexity. Specifically, since our algorithm also tries to minimize the required inserted delays, it makes the permissible range of delay insertion quite large. As a result, even though we only use the buffers in a standard cell library to realize the delay insertion (i.e.. the amount of inserted delay is not continuous), a feasible solution is easily found. Thus, the proposed approach can work well in practice.
PRELIMINARIES
A sequential circuit can be modeled as a circuit graph G(V,E), where V is the set of vertices and E is the set of directed edges. Each vertex R i s V represents a register. A special vertex called the hosI is introduced for the synchronization with primary inputs and primary outputs. A directed edge (R;,Rj) represents a data path from register Ri In regidlei R,. Each directed edge (Ri,Rj) is associated with a weight ( T~~i j~~i~) , TPOijlmar)), where Tpoij(,nm, and T P D ,~,~~~) are the minimum delay and lhe maximum delay, rcspectively. Let's use the sequential circuit ex given in Figure 1 Given a circuit graph G and clock period P, we can model the constraints of clocking hazards by a constraint graph C,,(C,P).
The constraint graph C,,(C,P) is a directed graph, where a vertex corresponds to a register and a directed edge corresponds to a constraint. For each directed edge (R,,KJ in the circuit graph G, it is replaced by a D-edge ed(R,,Rj) and a Z-edge ez(Rj,R,) in the corresponding constraint graph C,,(C,P). The D-edge ed(Rh,Rj), which is associated with a weight -TPojJ(,,,in,, corresponds to the double clocking constraint Tc,-Tc,>-T~~,~(,,,~~). The Z-edge e,(Kj,Ri), which is associated with a weight Tpoijlm,)-P, corresponds to the zero clocking constraint Tg-T,i>-Troij(m.x,-P. 
G,,(G,P).
A cycle is defined as a critical cycle if and only if the summation of weights is zero. Given a circuit graph G, there exists at least one critical cycle in the constraint graph Gc,(G,Pocss), where PKSs is the minimum clock period obtained by optimal clock skew scheduling (OCSS). For example, in Figure 3 , R,-R4-R3 is a critical cycle when P = 5. In a constraint graph, an edge from register R; to register K, is defined as a critical edge if and only if Tc;-Tc, is equal to the weight of this edge. Note that all the directed edges in a critical cycle are critical edges.
THE MOTIVATION
The delay-to-register ratio of a directed cycle C in a circuit graph G is defined as (the maximum delay of C) / (number of registers in C). Clearly, the maximum delay-to-register ratio of directed cycles in the circuit graph gives a lower bound of the clock period. We say that pij is the increase of the minimum delay from register Ki to register Rr Definition 2: A directed cycle in the constraint graph is a Z-cycle, if and only if all the directed edges in this cycle are 2-edges. Theorem I: Suppose that G is a circuit graph, whose minimum clock period is P. If no critical cycle in the constraint graph G,,(G,P) is Z-cycle, there exists a delay-inserted circuit graph G', whose minimum clock period P'<P. Prool: We prove the theorem by providing a method to construct the delay-inserted circuit graph G'. Suppose that T p~~j~~~~) = Did and TpDij(minl = dij in the circuit graph G. In the delay-inserted graph G', the increase of the minimum delay from register Ri to register Rj is denoted as pi> We let pij = 0 except for the following considerations. It is clear that G,,(G,P) contains no positive cycle. Each critical cycle C in the constraint graph G,,(G,P) has at least one critical D-edge ed(Ki,R,). An increase of the minimum delay TpD;r~minl can make the cycle C become non-critical. To increase Tpoidlm,nl, we consider the following two conditions. (I) If d,,<Dij, we let Ocpij<Dij-dij. The weight of D-edge e&K,) in G,,(G',P) is -(d,$+p;j), which is less than that in G,,(G,P) by p , j As a result, the cycle C in the constraint graph G,,(C',P) becomes non-critical. (2) If di,=D,,, we let O<p,,<P. The weight of D-edge ed(R,,R,) in G,,(G',P) is <dij+p,j), which is less than that in G,,(G,P) by pij-As a result, the cycle C in the constraint graph G,,(G',P) becomes non-critical. On the other hand, the weight of 2-edge e.(R,,R;) in G,,(G',P) is dij+piJ-P, which is larger than that in G,,(G,P) by plj. Since ea(K;,Rj) is a critical D-edge in Gr8(G,P), we have that Tc;-TC,=-Dcj=-dij. The Z-edge e.(Ri.RJ in G,,(G',P) corresponds to Tc,-Tc,2 djjtpij-P. Since O<P,~CP, in G&',P), the cycles containing the Z-edge e,(R,,R;) are non-critical .Q.E.D.
Lemma 1: Let G be the circuit graph of a sequential circuit cki. Let G' be a delay-insetted circuit graph of G. Assume that G' works with clock period P'. We can implement a sequential circuit works with clock period P' by applying the padding method [5] to the sequential circuit ckr. Note that a feasible padding can be found in polynomial time complexity.
THE APPROACH
Given a circuit graph Gsn, our optimization goal is to derive a delay-inserted graph G, , whose minimum clock period P , can achieve PB(G;J In addition to optimize the clock period, the proposed algorithm also tries to minimize the required inserted delays under the clock period P,(G;J. Figure 4 gives the pseudo code of the proposed algorithm, where
Gin is the initial circuit graph. The subroutine OCSS denotes the optimal clock skew scheduling. In the kth iteration, the notation GINSikl denotes the circuit graph obtained by delay insertion, the notations Sms0 and denote the optimal clock skew schedule and minimum clock period with respect to the G,stkl. and and SM,,(,,, denote the circuit graph and optimal clock skew schedule obtained by delay minimization. Suppose that TPoidlmu) = Did and Tso;J~minl = d;j in the circuit graph G;..
Since the circuit graphs Gms1, and GMcN(k+,) are delay-inserted graphs of G,,, in the following, they are represented in the form as TpDijlmarl = max(D,j,d;J+pi,) and Tpo;jlmm) = diJ+p;j, where pij is the increase ofthe minimum delay and pij 2 0.
Procedure Our-Approach(G,.) k = 1; GYmIk) = G,n; (SMM(k)~PMm(!+) = OCSS(GMm(k1); GINS(k)= DClaY_lnse~ion(~MMlk),SYIN(k)); (sMSlk1,P~NSlk)) = OCSS(Gmxk1);
(GMlN(k+l).SHIN(k+I).PhllNlk+i))=
DelaY_Minimization(G~s~),P~s(kl); k = k+l; until (pM,Nlk.I)==plNSlk-l)); Go,,= GMrNlk.1); saw = SMMIh-11; pop,= PMINlk-11; The subroutine Deluy-Insertion is to obtain a circuit graph G~Ns(x) by inserting delays to the data paths, whose minimum delays are critical with respect to SMlNlk1, in GMINlkp Figure 5 gives the pseudo code. We examine every data path in arbitrary sequence. For evcry critical minimum delay TpD,d(m;nl, we consider the following two conditions. If Tpo~jl,i.)<Tp,;j(,.x), then we let the increased delay piJ be DjJ-dij; if Tpo,jlm,,l"TPo,J(morl, then we let the increased delay pxJ be p;J+(PMlNfkl12). After every data path is examined, we obtain a circuit graph GlNS(k), which is a delayinserted graph of GulNe,. All the D-edges in the constraint graph GrE(G,~s(k),P~c~(k)) arc non-critical. Thus, applying OCSS to GINS(L) may lead to further clock Deriod minimization.
LretUA(CLNxkl);) The subroutine Deluy-Mininiizu/ion is to obtain a circuit graph GMINIkII) from the circuit graph GINSIX) by minimizing the increased delay pij of the minimum delay from register R; to rzgister Rj with respect to the clock period PINslk,. Figure 6 gives the pseudo code. We examine every delay-inserted data path in arbitrary sequence. When a delay-inserted data path is examined, we let pld become a variable. Suppose that pjJ = &j in the circuit graph GINsn,. Then, the variable pij has the following property: if p;j =a works with clock period PlNsfk), p;j = a' also works with clock period PINSlkl, where 0 5 a i a' i aid; ifp,, =a does not work with clock period PINSfkl, pli = a' also does not work with clock period Plhlslkl, where 0 < a' i a i a,J, Thus, we can use the binary search strategy to find the minimum value of variable P ,~, where the interval of binary search is [O,a,,]. After evely delay-inserted data path is examined, we obtain the circuit graph CM~(ktl). The SuINlkII1 can be found by solving the longest path problem on the constraint graph G,,(G,,,(~~,,.P,,S~L,). Note that G , N S~L~ also works with clock period PINSlk) under SMNfk+,k Procedure Delay-Minimization(G,~slk~,P,,,(t,);
for each delay-inserted data path in GINSfkl do G,,(GM,~(k),PHINjX)), then PINS(k)cPMIN(kp G=Gi~s(kl; { let p,j be a variable in the constraint graph G,,(G,P,,(k)) and use the binary search strategy to find the minimum value ofp,j; let pu in G be the minimum value obtained by the binary search; } Theorem 4: The clock period Po, obtained by our approach achieves P,(G,"). Proof: The iteration in our approach terminates only when a critical Z-cycle is formed. Thus, P,,, = PB(GOp,). Due to delay insertion, we have PB(G,,,) 2 PB(GjJ. From Theorem 3, we know that e,(R,,R;) is not a critical Z-edge if p,j > Djj-dcj, Therefore, our approach still terminates with the Z-cycle that determines PB(Gi.). Consequently, we have P8(Gop,) = Pi(Gj.). Q.E.D.
Theorem 5 Our approach is in polynomial time complexity. Proof: The PINslkl can be smaller than PMINlkl. only when in G,Nsls there is at least one data path (R,,Ri) whose pij value is larger than that in GMINep The number of data paths is polynomial. The maximum number of iterations to increase the minimum delay of a data path is also polynomial. Q.E.D.
In the following, we use the sequential circuit ex to illustrate the proposed algorithm. Next, our algorithm moves to the procedure Deloy_Minimizution. We have P,.~=I tu and p,,,=l tu in GHwO), SMlN13)=(0,-2,-1,0,2), and PMlN(,)=4 tu. The constraint graph GC~(GMINlI~.PMM131) is given in Figure 8 . Then, we move to the procedure Deluy-Inserlion. We have ~, ,~= 5 tu and p,,, = l+(PMnv13,1Z) = 3 tu in GmSII). However, we cannot further reduce the clock period. Finally, we move to the procedure Delay-Minimization. We have P,,~=I tu and p,,,=I tu. Thus, Gopc = cMBI(4) = GMINIV SMlN(3b and pop, =
PNM(4) = PYlN(3)
We apply the padding method to the sequential circuit ex. The padded sequential circuit ex' is given in Figure 9 . Two delay elements M and N are added. The minimum delays Tp,cl,inl and are increased from I tu to 2 tu. The maximum delay TpD~,,(m,l is also increased from 1 tu to 2 tu. The sequential circuit ex' works with clock period POp,74 tu under S.p=(0,-2,-1,0,2). Note that, due to delay minimization, the p3,r and p ,~ are minimized. However, in fact, the clock skew schedule (0,-2,-1,0,2) works with clock period Pop, = 4 tu ifp,* is between 1 tu and 5 tu and p1.1 is between 1 tu and 3 tu. Therefore, the permissible range of delay insertion is quite large. Due to this property, it is easy for us to find a feasible solution by using a cell library. 
EXPERIMENTAL RESULTS
The ISCAS'89 benchmarks are targeted to a 0.35 pm cell libraly. Table 1 tabulates the characteristics of test circuits. The columns #Ce//s, #Registers, #DatnPatks, Longesl, and PmSs denote the number of cells, the number of registers, the number of data paths, the longest path delay, and the minimum clock period obatined by the OCSS, respectively.
The proposed algorithm has been implemented by using a C programming language on a Sun Ultra-5 workstation. Table 2 tabulates the experimental results of our algorithm, including the number of delay-inserted. data paths and minimum clock period (i.e., Pap,). Compared with P , , , the result obtained by our algorithm (i.e., Pop,) has significant improvement. Note that, in every test circuit, our optimized result achieves the lower bound of the clock period and can be obtained within only few seconds.
We also realize the delay insertion by using the buffers in the cell library, including buffers buffu, buff/, bu@, and buff/. Table 3 tabulates the implementation results. The column Delay Insertion gives the numbers of four type buffers for delay insertion. The column P , , , is the minimum clock period obtained by the implementation. Since buffer insertion may affect other cells, e.g., wire capacitances and transition times of adjacent cells, Pimp may be slightly larger than P-.
CONCLUSIONS
This paper incoporates optimal clock skew scheduling and delay insenion for the synthesis of non-zero clock skew circuits. The proposed algorithm not only optimizes the clock period, but also tries to minimize the required inserted delays. Benchmark data consistently shows that our approach works well in practice.
ACKNOWLEDGEMENTS
This work was supported in part by the National Science Council of R.O.C. under contract number NSC 91-2215-E-033-005.
