A common technique in high-performance hardware design is to intersperse combinatorial logic freely between level-sensitive latch layers (wherein one layer is transparent during the "high" clock phase, and the next during the "low"). Such logic poses a challenge to verificationunless the two-phase netlist N may be abstracted to a full-cycle model N (wherein each memory element may sample every cycle), model checking of N requires at least twice as many state variables as would be necessary to obtain equivalent coverage for N . We present an algorithm to automatically obtain such an abstraction by selectively eliminating latches from both layers. The abstraction is valid for model checking CTL* formulae which reason solely about latches of a single phase. This algorithm has been implemented in IBM's model checker, RuleBase, and has been used to enable model checking of IBM's Gigahertz Processor, which may not have been feasible otherwise. This abstraction has furthermore allowed verification engineers to write properties and environments more efficiently.
Introduction
A latch is a hardware memory element with two Boolean inputs -data and clock -and one Boolean output. A behavioral definition for latches is provided in [1] . High performance netlists often must use level-sensitive latches [2] . For such a latch, when its clock input is a certain value (e.g., a logical "1"), the value at its data input will be propagated to its data output (i.e., transparent mode); otherwise, its last propagated value is held at its output.
The clock is modeled as a signal which alternates between 0 and 1 at every time-step. A latch which samples when the clock is a 1 will be denoted as an L1 latch; one which samples when the clock is a 0 will be denoted as an L2 latch. Hardware design rules, arising from timing constraints, require any logic path between two L1 latches to pass through an L2 latch, and vice-versa. An elementary design style requires each L1 latch to feed directly to an L2 latch (called a master-slave latch pair), and allow only L2 to drive combinatorial logic. However, a common high-performance hardware development technique involves utilizing combinatorial logic freely between L1 and L2 latches to better utilize each half-cycle. It should be noted that such designs are typically explicitly implemented in this manner; this topology is not the byproduct of a synthesis tool.
There are two major problems with the verification of such netlists. First, because of the larger number of latches, the verification tool requires much more time and memory. Additionally, the manual modeling of environments and properties is more complicated in that they must be written in terms of the less abstract half-cycle model, and an oscillating clock must be explicitly introduced.
Most hardware compilers will allow automatic translations of a master-slave latch pair into a single flip-flop; retiming algorithms [3] may be used to retime the netlist such that L1-L2 layers become adjacent and one-to-one. However, retiming adds complexity in that the specification, the environment, and any witnesses / counterexamples (all of which may "observe" the netlist), may need to be retimed as well to match the retimed, full-cycle model.
We develop an efficient algorithm for abstracting a half-cycle netlist N to a full-cycle model N , which may be utilized for enhanced verification in any FSM-based verification framework (e.g., simulation and model checking). We will achieve this by selectively eliminating some latches. We will use a notion of dual-phase-bisimulation equivalence between the abstracted and unabstracted models. This equivalence ensures that specification and environment written in terms of L2 latch outputs need not be modified other than a conversion to fullcycle format (as will be discussed in Sect. 3). Our algorithm performs maximum such reductions, and thus provides an important model reduction step which may greatly augment existing techniques (such as retiming, cone-of-influence, etc.). As we show, this reduction alone reduces the number of state variables by at least one-half, and has greatly enhanced the model checking of IBM's Gigahertz Processor, which may not have been feasible otherwise (as demonstrated by our experimental evidence). This abstraction is now part of the model checker RuleBase [4] . Additionally, designers and verification engineers prefer to reason about the full-cycle models.
The optimality of the algorithm results from the identification of minimal dependent layers (MDL) of latches, and removing all L1s or all L2s per MDL.
Definition 1.
A dependent layer is a set of L1 and L2 latches L1 and L2 , such that L2 is a superset of all latches in the transitive fanout of L1 , and L1 is a superset of all latches in the transitive fanin of L2 .
Definition 2. A dependent layer is termed minimal if and only if there does not exist a nonempty set of L1 and L2 latches L which may be removed from that layer and still result in a nonempty dependent layer.
Consider the netlist in Fig. 1 (the triangles denote combinatorial logic, and the rectangles denote latches). The L1 latches are shaded. The two unique MDLs are marked with dotted boxes. Merely removing all L1s or all L2s will not yield an optimum reduction in this case; the L1s of layer A, and the L2s of layer B should be removed to yield an optimum solution for this netlist, which removes four of the six latches. In Sect. 2 we introduce a half-cycle netlist, and two different abstracted fullcycle models of this netlist. In Sect. 3 we study the state space of the netlist and its two abstracted models to demonstrate the validity of the abstraction for CTL* formulae which reason solely about latches of a single type (L1 or L2). In Sect. 4 we introduce the algorithm used to perform the netlist reduction, and demonstrate its optimality. In Sect. 5 we give some experimental results of the use of this algorithm as implemented in RuleBase [4] for application to IBM's Gigahertz Processor.
Half-Cycle versus Full-Cycle Models
Consider the half-cycle netlist, denoting an MDL, shown in Fig. 2 . All nets and primitives may be vectors. The first two rules are enforced by hardware timing constraints. Note that, at the periphery of a design, there may be some inputs which have L2 latches in their transitive fanout, and outputs which are in the transitive fanout of L1 latches (thus violating rules 3 and 4). While the analysis presented in this paper disallows such connectivity for simplicity, these cases are supported by our implementation; for ease of reasoning, we have found it beneficial to preserve all L2 latches which violate rule 3, and to remove all L1 latches which violate rule 4.
The notion of MDLs (Defn. 2) allows us to partition the design under test into a maximum number of partitions such that each is DP. Next, we propose two abstractions for DP netlists. For each DP partition of the original design, one of these two abstractions may be applied independently of the other partitions, thus yielding an overall abstraction which has a globally minimum number of latches (refer to Theorem 5). This minimum would, in general, be less than removing either all of the L1 or all of the L2 latches.
In this paper, we assume that properties may only refer to the L2 nets (which we term L2 − visible properties). In our actual implementation, we also handle the case where the properties refer only to L1 nets. Furthermore, by forcing our tool to remove only L1 or only L2 latches (i.e., restricting its freedom to choose which type to remove), each property may refer to both types of nets. However, we skip these generalizations in this paper.
The Abstracted Models
The values of the nets in Fig. 2 are specified for time-steps i ≥ 0. The prespecified initial values of the latches are B 0 (0..m − 1) and D 0 (0..n − 1). Let c denote the clock input, which initializes to 1, and alternates between 1 and 0 at every time step, indicating whether the L1 or L2 latches (respectively) are presently transparent. The subscript i means "at time i".
Either layer of latches may be removed (and the remaining layer transformed to flip-flops which may be clocked every cycle -not by an alternating clock), and the resulting abstracted model will be shown to be bisimilar to the original netlist. Fig. 3 shows the first abstraction with layer L2 removed. We need a new variable, labeled f , whose initial value is 1, and thereafter is 0. This latch ensures that the initial value D 0 from the original netlist N (which need not be deterministic) is applied to the combinatorial nets D in N . B 0 (0..m − 1) is still the initial value of the remaining latches. 
). Note that the f variable is unnecessary for this abstraction; the initial value of the removed latch does not propagate.
It is noteworthy that either one of the two abstractions may be chosen; since the layers may be of differing width (m = n), the removal of one layer may result in a smaller state space than the other. We term both of the above reductions as dual-phase (DP) reductions.
Validity of Abstraction
We define a notion of dual-phase-bisimulation relation (inspired by Milner's bisimulation relations [5] ); this notion is preserved for composition of Moore machines. Further, if two structures are related by a dual-phase-bisimulation relation, we show that L2−visible CTL* properties are preserved (modulo a simple transformation). We show the existence of a dual-phase-bisimulation relation for both abstractions presented in the previous section.
We will relate our designs to Kripke structures, which are defined as follows.
where S is a set of states, S 0 ⊆ S is the set of initial states, A is the set of atomic propositions, L : S → 2
A is the labeling function, and R : S × S is the transition relation.
Our designs are described as Moore machines (using Moore machines, instead of the more general Mealy machines [6] , simplifies the exposition for this paper, though our implementation is able to handle Mealy machines). We use the following definitions for a Moore machine and its associated structure (similar to Grumberg and Long [7] ). O is the output function.
and R((s, x), (t, y)) iff δ(s, x, t).
In the sequel we will use M to denote the Moore machine as well as the structure for the machine. We now define our notion of dual-phase-bisimilarity, which characterizes our proposed abstraction. , s 1 , s 2 , . . .), denoted by G(π, π ), iff for every i, G(s 2i , s i ) .
Definition 7. Let M and M be two structures. A relation
G ⊆ S × S is a dual-phase-bisimulation relation if G(s, s ) implies: 1. L(s) = L (s ).
for every t, v ∈ S, such that R(s, v) and R(v, t), we have L(s) = L(v), and there exists t ∈ S such that R (s , t ) and G(t, t ).

for every t ∈ S , such that
R (s , t ), there exist t, v ∈ S such that L(v) = L (s ), R(s, v),
R(v, t), and G(t, t ).
We say that a dual-phase-bisimulation exists from
The composition of Moore machines (M 1 M 2 ) is defined in the standard way [7] , by allowing the outputs of one design to become inputs of the other. The following result is shown similarly as the proof that simulation precedence is preserved under composition [7] .
The set of dual-phase-reducible CTL* formulae is a subset of CTL* formulae [8] , and is a set of state and path formulae given by the following inductive definition. We also define the dual-phase reduction for such formulae: Definition 8. A dual-phase-reducible (DPR) CTL* formula φ, and its dualphase reduction, denoted by Ω(φ), are defined inductively as:
Note that XX is transformed to X through the reduction; intuitively, this is due to the "doubling of the clock frequency", or the replacement of the oscillating clock with an "always active" clock, enabled by the abstraction. As an example, if φ = AG(rdy → (AXAX(req → AF(ack)))), then Ω(φ) = AG(rdy → (AX(req → AF(ack)))) (note that AXAXp is equivalent to AXXp). L2 − visible properties may be readily expressed utilizing DPR CTL*, since latches of any given type may only toggle every second time-step; there is no need to express such a property with a single X, which is the only restriction we impose upon full CTL* expressibility. 
.) be infinite paths of M and M , respectively. If G is a dualphase-bisimulation relation such that G(s, s ) and G(π, π ), then 1. for every dual-phase-reducible CTL* state formula φ, s |= φ iff s |= Ω(φ).
for every dual-phase-reducible CTL* path formula φ, π |= φ iff π |= Ω(φ).
We describe the Moore specifications (Defn. 5) for the abstractions presented in Sect. As presented here, the properties cannot refer to inputs -V N does not contain inputs. This restriction is due to the requirement that visible labels of states s 2i and s 2i+1 are identical (Defn. 7), and is not necessary if the inputs to the design do not change values between s 2i and s 2i+1 . This assumption is typically sound; except for clock inputs (which no longer need to be modeled), synthesis timing constraints enforce this requirement (since the partition will ultimately be composed with other partitions, or occur at chip boundaries). After our abstraction, the environment is no longer constrained from toggling only once every two time-steps, but may toggle every time-step -this reflects a conversion of the environment from half-cycle to full-cycle, and (along with the synthesis requirements reflected in rules 1 and 2 of Defn. 3, and the synthesis requirement that the design be free from combinatorial loops) allows applicability of this abstraction to Mealy machines.
The first abstraction Proof. The following relation G between states of N and N is a dual-phasebisimulation relation. G is defined so that it is 1 only for the following two cases:
Proof. The following relation G between states of N and N is a dual-phasebisimulation relation. G is defined so that it is 1 only for the following two cases: 
Algorithm for the Abstraction of DP Netlists
The algorithm picks a primary input at random -a while loop ensures that every primary input is chosen. It then finds the latches in the transitive fanout of this input -this set is called L1 , and must consist solely of L1 latches (except for inputs connected to L2s, which are treated specially). It places these elements of L1 one-at-a-time into the set L1 . For each latch in L1 not previously considered, it finds all L2 latches in the transitive fanout of L1 -this set is denoted L2 . It then looks for any latches in the transitive fanin of L2 -these must be L1s -and adds them to L1 . It then iteratively ping-pongs between the L1s and L2s for this MDL until no new latches are found. These latches are now labeled with their type and layer identifier (which is then incremented), and a record kept as to the number of L1 and L2 latches in that layer. It then continues iteratively with the next element of the set L1 .
The algorithm then looks for L1 latches in the transitive fanout of the L2s encountered in the previous layers. If it finds any, these new MDLs are explored iteratively as above until no new latches are encountered. The outer while loop then begins traversing from the next primary input.
If the algorithm encounters a previously-marked L1 latch while looking for an L2 latch (or vice-versa), it flags this violation. If no violation has been reported during the analysis, the netlist is DP, and reduction may proceed.
After the above analysis, either the L1 or the L2 set may be removed per layer; these layers are minimal by construction. A simple iteration over every MDL will yield optimum reductions; if the given layer has more L2s than L1s, the L2s of that layer should be replaced with multiplexors as discussed in Sect. 2.1. If not, the L1s of that layer should be replaced with wires.
If the type of all latches is provided (L1 versus L2), an alternate algorithm may simply iterate over each latch within the netlist, and calculate its MDL given its type. When this abstraction was initially deployed, no such type data was automatically available; the inputs provided a convenient point of reference.
Theorem 5. This algorithm performs optimum DP reductions.
Proof. By construction, each latch will be a member of exactly one MDL. Furthermore, the MDLs are of minimum size, resulting in a maximum number of dependent layers in the netlist. Since each MDL is independent of the others, the locally optimal solutions yield a globally optimum result.
Note that along any input -output path within a single MDL, exactly one flip-flop must exist after abstraction -if zero or two exist, the bisimulation is clearly broken. Take any latch from any MDL which which was removed by the abstraction -assume that it is an L1 latch L1 . All L2s L2 in the fanout of L1 must remain. All L1s in the fanin of L2 must have been removed, and so on until we are left with the case that (within this MDL) all of the L1s are removed, and all of the L2s remain, if this is a correct abstraction (similar reasoning applies to consideration of a removed L2 latch). This demonstrates optimality of reduction of each MDL.
The algorithm may be optimized to ensure that each combinatorial gate (or net) is only considered once in fanout traversal, and once in fanin traversal, to ensure that its complexity is O(netlist size 2 ). However, in practice, we have found that the complexity of this algorithm grows roughly linearly with model size and takes a matter of seconds for even the largest designs we have considered for model checking (more than 8,000 latches). This near-linearity is not surprising for synthesizable high-performance netlists, since the depth of combinatorial logic between latches and the number of sinks of a net are restricted to ensure that the netlist meets timing constraints.
Experimental Results
The above algorithm was implemented into the model checker RuleBase [4] , developed in IBM Haifa Research Lab as an extension to SMV [9] . It is utilized as a first-pass netlist reduction technique; the reduced full-cycle model is saved and used as the basis for further optimizations before being passed to SMV for model checking.
This algorithm was deployed for use on many components of IBM's Gigahertz Processor. The reduction results obtained by this step are given in Table 1 below. These numbers do not reflect the results of any other reduction techniques. We recommend, due to the speed of this algorithm (O(n 2 ) in theory, but roughly O(n) in practice) and its global preservation of L2 − visible properties, that it be used as a first-pass reduction technique upon design compilation. The resulting abstracted design may then be analyzed for formula-specific reductions (e.g., cone-of-influence, constant propagations, retiming), which are likely to proceed faster upon the abstracted design due to the fewer number of latches and simpler transition relation (the clock is no longer in the support of the transition relation). During the initial stages of model checking, this abstraction was not available. Once the abstraction became available, properties which previously took many hours to complete would finish in several minutes. More encompassing properties became feasible on the abstracted model which would not otherwise complete.
As a small example, a property run on the Load Serialization Logic which took 25.6 seconds, 36 MB of memory on the abstracted model (with 81 variables) took 450.2 seconds, 92 MB of memory for the unabstracted netlist (with 116 variables) on the same machine (an IBM RS/6000 Workstation Model 590 with 2 GB main memory), with no initial BDD order. This time includes that necessary to perform the netlist analysis and reduction. As a larger example, a property run on the Instruction Flushing Logic took 852 seconds of user time, 48 MB on the abstracted model (with 96 variables). This same property did not complete on the unabstracted netlist (with 162 variables) within 72 hours.
While it may seem surprising that the number of variables after abstraction is more than half that before abstraction, this is due to two phenomena. First, some of these variables are used for environment and specification; these are modeled directly as flip-flops (rather than L1-L2 latches). Second, in some cases, RuleBase was able to exploit some redundancy among these variables through other model reduction techniques (e.g., constant simulation).
The benefits obtained by this algorithm extend beyond a mere reduction in state depth, which reduces the time and memory consumed by reachability calculations. BDD variable reordering time is often greatly reduced (since the BDDs tend to be smaller, and since with less variables a "good ordering" tends to be faster to compute). The reduction to full-cycle models also reduces the number of image calculations necessary to reach a fixed-point or on-the-fly failure -the diameter of the model is halved. Further, since fewer state variables require evaluation, it is possible that the above reduction may be exploited to "collapse" adjacent functions to a single function, which may be represented on the same BDD. However, this risks blowing up the BDD size; the functions may thus remain distinct and implicitly conjoined [10] to ensure proper evaluation.
With this abstraction available, as demonstrated above, model checking was enabled to verify much "larger" and more meaningful properties in less time. Users of our tool have found that writing specifications and environments for the full-cycle abstracted models is much less complex than for the corresponding half-cycle netlists (as is viewing traces). All RuleBase users quickly converted to running exclusively with this abstraction. There have been many hundreds of formulae written and model checked to date on this project, which collectively have exposed on the order of 200 bugs at various design stages. We have not encountered any properties we wished to specify which became impossible on the abstracted model. This algorithm thus provided an efficient and necessary means by which to free ourselves from the verification burdens imposed by the low level of the implementation.
It is noteworthy that roughly 70 HDL bugs were isolated due to violations of L1-L2 connectivity during this work. While algorithms for detecting such problems are simple (and other tools implementing such checks became available later in the design cycle), the many benefits resulting from this reduction provided strong motivation for quickly correcting these errors. Due to the na-ture of logic interpretation in simulation and model checking frameworks, the logic flawed in such a manner typically behaved "properly" for verificationthese platforms assume zero combinatorial delay, but no combinatorial "flowthrough" for two adjacent level-sensitive latches even if both are simultaneously in the transparent phase.
Conclusions
We have developed an efficient algorithm for identifying and abstracting dualphase L1-L2 netlists. The algorithm performs netlist graph traversal, rather than FSM analysis, hence is CPU-efficient -O(n 2 ) in theory, but roughly O(n) in practice due to timing constraints imposed upon synthesizable netlists. The benefits obtained by the abstraction include much smaller verification time and memory requirements (through "shallower" state depth -often less than onehalf that necessary without the abstraction -which reduces complexity of the transition relation and simplifies BDD reordering, and a halving of the diameter of the model), as well as more abstract specification and environment definitions. A bisimulation relation is established between the unreduced and reduced models. This reduction is optimum, and is valid for model checking CTL* formulae which reason solely about latches of a given phase. Experimental results from the deployment of this algorithm (as implemented in the model checker RuleBase) upon IBM's Gigahertz Processor are provided, and illustrate its extreme practical benefit.
